Issue Details (XML | Word | Printable)

Key: FILEUPLOAD-101
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Unassigned
Reporter: Oleg Kalnichevski
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Commons FileUpload

[FileUpload] does not take 'charset' parameter of the 'Content-Type' header into consideration

Created: 16/Jun/03 09:51 PM   Updated: 30/Oct/06 09:53 PM
Return to search
Component/s: None
Affects Version/s: 1.0 Beta 2
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File fileupload.patch 2004-03-10 02:08 AM Oleg Kalnichevski 23 kB
Text File fileupload.patch 2003-09-14 06:39 PM Oleg Kalnichevski 24 kB
Text File fileupload.patch 2003-07-31 03:47 AM Oleg Kalnichevski 23 kB
Text File fileupload.patch 2003-06-20 10:12 PM Oleg Kalnichevski 23 kB
Environment:
Operating System: All
Platform: All

Bugzilla Id: 20813
Resolution Date: 30/Oct/06 09:53 PM


 Description  « Hide
Each individual part in the 'multipart/form-data' encoded requests may have its
own content type definition. Within the content type definition the HTTP agent
may provide a character encoding to be used when processing the content of the
part in question. Currently FileUpload does not adequately support custom
character encoding of individual parts in the 'multipart/form-data' encoded
requests.

– RFC1867: Quote ------------

7. Registration of multipart/form-data

The media-type multipart/form-data follows the rules of all multipart
MIME data streams as outlined in RFC 1521

– RFC1867: End of quote -----

For more details refer to http://www.ietf.org/rfc/rfc1867.txt

– RFC1521: Quote ------------

7.1 The Text Content-Type

The text Content-Type is intended for sending material which is
principally textual in form. It is the default Content-Type. A
"charset" parameter may be used to indicate the character set of the
body text for some text subtypes, notably including the primary
subtype, "text/plain", which indicates plain (unformatted) text. The
default Content-Type for Internet mail is "text/plain; charset=us-
ascii".

– RFC1521: End of quote -----

For more details refer to http://www.ietf.org/rfc/rfc1521.txt

Of course, the agent does not have to set 'charset' parameter, but if it does,
the parameter must be taken into account.

Right now, method DefaultFileItem#getString() produces erroneous result (at
least in my humble opinion)



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Oleg Kalnichevski added a comment - 18/Jun/03 02:36 PM
I respectfully disagree. In my opinion this is a fairly serious bug, not a
feature request. If you like I could come up with a patch to address this issue.
In HttpClient project we confront similar problems quite often and have a lot of
code I can draw upon. Just let me know.

Oleg


Martin Cooper added a comment - 20/Jun/03 01:16 PM
I understand that custom crafted requests, such as those generated using
HttpClient, may encounter this type of issue. However, I have seen no evidence
that the most commonly used clients (i.e. the most popular browsers) pay any
attention to the 'charset' parameter. Therefore I maintain that this should be
considered as an issue to be considered after a FileUpload 1.0 release.

Oleg Kalnichevski added a comment - 20/Jun/03 02:58 PM
Martin,

I understand your position. But before this issue is put on back burner, allow
me to say that I find standards compliance of mainstream browsers in several
areas simply appalling. 'All big boys do it' is a weak argument in my opinion.
We can do better.

Anyways, I have the patch almost ready. The patch will retain complete API
compatibility and will include a reasonable number of test cases. I'll attach it
to this bug report by Monday next week the latest and leave it up to FileUpload
maintainers to decide what to do with it.

Once release 1.0 is branched out I would be prepared to contribute additional
patches intended to improve FileUpload standards compliance record. These
changes will most certainly require some API revision. Feel free to knock on my
door if there's some interest.

Cheers

Oleg


Oleg Kalnichevski added a comment - 20/Jun/03 10:12 PM
Created an attachment (id=6914)
Patch (take 1)

Martin Cooper added a comment - 21/Jun/03 11:59 AM
Oleg, I understand your position too, but please realise that FileUpload is at
the Release Candidate stage now. I took a look at the patch you attached, and
that is WAY too much change to be introducing into a Release Candidate.

What I would like to do is to release FileUpload 1.0 Final as is, very soon,
and then we can work on getting your (take >1?) patch in, and release that as a
1.1 in a short timeframe. Beyond that, I would certainly be interested in your
contributions to improve FileUpload, and if there are likely to be significant
API changes, then perhaps we can target that for a 2.0 release.

Getting a FileUpload 1.0 Final release out the door very soon is really, really
important, because there are so many other projects depending on it now, and
waiting for such a release before they themselves can release. Those projects
include Tomcat, Turbine, Tapestry and Struts, to name only the Jakarta projects
I know about, and they have no earlier release of FileUpload to fall back on.


Oleg Kalnichevski added a comment - 21/Jun/03 10:02 PM
Fair enough. Good luck with the release. Cheers.

Oleg


Oleg Kalnichevski added a comment - 31/Jul/03 03:47 AM
Created an attachment (id=7593)
Patch (take 2)

Oleg Kalnichevski added a comment - 31/Jul/03 03:51 AM
The parameter parser has been optimized to eliminate unnecessary object
instantiation. Parsing should be a bit faster now.

Oleg


Christoffer Hammarström added a comment - 12/Sep/03 08:55 PM
I believe the default charset to be used is ISO-8859-1 (compatible with
US-ASCII) if none is given.

From RFC2046, MIME Part Two: Media Types:

– RFC2046: Quote ------------

4.1.2. Charset Parameter

A critical parameter that may be specified in the Content-Type field
for "text/plain" data is the character set. This is specified with a
"charset" parameter, as in:

Content-type: text/plain; charset=iso-8859-1

Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive. The default character set, which
must be assumed in the absence of a charset parameter, is US-ASCII.

– RFC2046: End of quote -----

From RFC2616, HTTP/1.1:

– RFC2616: Quote ------------

The "charset" parameter is used with some media types to define the
character set (section 3.4) of the data. When no explicit charset
parameter is provided by the sender, media subtypes of the "text"
type are defined to have a default charset value of "ISO-8859-1" when
received via HTTP. Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value. See
section 3.4.1 for compatibility problems.

– RFC2616: End of quote -----


Oleg Kalnichevski added a comment - 12/Sep/03 09:54 PM
One may argue that RFC2046 should take precedence in this particular case, but I
agree that ISO-8859-1 is most likely to be expected by the users as a default
HTTP content encoding

Anyways, the patch is wrong in using default system encoding when no charset
explicitly set. I'll tweak the patch and submit an updated version this weekend.

Oleg


Christoffer Hammarström added a comment - 12/Sep/03 10:28 PM
Oleg: While it's probably not required just for the charset parameter of the
content-type header, you might want to add support to your ParameterParser for
escaped characters inside the quoted parameter-value:

Header: header-value; parameter="value with \"quotes\" inside"

The javadoc for "ParameterParser.parseQuotedToken(final char[] terminators)"
states "Special characters in quoted tokens are escaped.", which doesn't seem to
be the case. Forgive me if i'm just being blind and stupid.

The syntax of the content-type header in RFC2045 (section 5.1) is based on
RFC822, which defines quoted-string:

– RFC822: Quote ------------

qtext = <any CHAR excepting <">, ; => may be folded
"\" & CR, and including
linear-white-space>
quoted-pair = "\" CHAR ; may quote any char
quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or
; quoted chars.

– RFC822: End of quote -----


Oleg Kalnichevski added a comment - 12/Sep/03 10:53 PM
Fair enough. I'll do so.

Oleg


Oleg Kalnichevski added a comment - 14/Sep/03 06:39 PM
Created an attachment (id=8205)
Patch (take 3)

Oleg Kalnichevski added a comment - 14/Sep/03 06:43 PM
The new patch should
  • properly handle escaped chars inside quoted paramerter values
  • apply correct default encoding (ISO-8859-1) when charset is not explicitly set
    by the sender

Let me know what you think

Cheers

Oleg


Olivier Vit added a comment - 23/Sep/03 09:57 PM
Hi

We did a local integration of the patch provided by Oleg
http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=8205 and is
successfully solved or issues:

we wanted to allow iso latin characters to be allowed in a form mixing text
inputs and file uploads. until applying the patch, we got only question marks in
place of each character specific to the iso latin charset, which is REALLY
unacceptable for european people.

Then please consider incorporating it really soon in any upcoming release !!


Christoffer Hammarström added a comment - 27/Feb/04 07:16 PM
Is there any progress being made on this?

Oleg Kalnichevski added a comment - 27/Feb/04 09:08 PM
It appears none of the present maintainers has got any bandwidth left for
[FileUpload]. I believe I have access permissions to check the patch in myself;
however, I am reluctant to do it without having an explicit OK from someone
overseeing the development of [FileUpload].

Oleg


Olivier Vit added a comment - 27/Feb/04 09:23 PM
Check-in would already make our life easier (avoids to patch the code ourselves,
as we needed to), then please go.
The next real question is when and how can a new release of this package be
delivered ?

Martin Cooper added a comment - 01/Mar/04 03:54 PM
Several comments on this:
  • That this issue has not yet been addressed is not a good reason to assume
    that nothing is happening with FileUpload. I have multiple fixes pending
    verification, as well as some refactoring in progress.
  • In my opinion, there are several other FileUpload issues of higher priority
    that this one. Hence this issue is not receiving top priority attention.
  • I am not willing to commit a patch for this issue without an accompanying
    unit test. The issue, as I understand it, does not arise with current browsers,
    so without a unit test, I would be "driving blind".
  • The fact that you are at "take 3" on a patch for this emphasises the need for
    unit tests, and suggests that the solution is not yet fully worked out.

By the way, I am finding it *really annoying* that comments from Christoffer
in this bug report show up as giant 'mailto' links, so Christoffer, please,
please add your comments as plain text!


Christoffer Hammarström added a comment - 01/Mar/04 04:54 PM
Martin:
  • Oleg's patch does contain unit-tests. I don't know whether they are incomplete
    or covering everything, though. Maybe you could comment on them?
  • I always have written my comments as plain text into the "Additional Comments"
    box of bugzilla. My guess is that your browser is taking issue with the "ö" in
    my last name. Do you think that could be the problem? I myself am using Mozilla
    1.6, with no problems.

I am looking forward to the next version!


Christoffer Hammarström added a comment - 01/Mar/04 04:59 PM
Martin:

It seems that bugzilla has switched to using UTF-8 since i first registered my
name, making my ISO-8859-1 representation of my name mangled in UTF-8. I've
fixed my name in bugzilla now, and i hope that'll solve the problem.


Olivier Vit added a comment - 01/Mar/04 06:20 PM
Hi
I confirm that we had the issue (accented characters are not preserved in forms
mixing data input and file uploads) with data in the iso-latin charset and using
any recent web browser like MSIE 6, Mozilla 1.4, Konqueror, etc..., on either MS
Win XP, Linux 2.4, Win 2000

Then please revise your evaluation of the priority for this issue
Regards

(some server details : Java HotSpot(TM) Client VM (build 1.4.1_02-b06, mixed
mode), tomcat 4.1.24, Linux 2.4)


Oleg Kalnichevski added a comment - 08/Mar/04 01:01 AM
> * I am not willing to commit a patch for this issue without an accompanying
> unit test. The issue, as I understand it, does not arise with current browsers,
> so without a unit test, I would be "driving blind".

Martin,
I doubt you had taken even a cursory look at the patch. Had you done so, you
might have noticed that there had been accompanying test cases since the very
first submission. If you think that some test cases are missing or certain
functionality is not adequately covered with the existing test cases, I'll be
more than willing to provide the missing bits

> * The fact that you are at "take 3" on a patch for this emphasises the need for
> unit tests, and suggests that the solution is not yet fully worked out.

This is a very interesting observation. Actually the patch has been ignored for
so long (over 8 months) that I found enough time to do some object allocation
and performance optimization on the parameter parser, hence take 2. Besides,
this patch is pretty much a code donation from the [HttpClient] project where
the parameter parser has been around ever since the initial patch submission.

If you see some flaws in implementation, feel free to point them out, and I'll
happily do my best to correct them. Feel free to reject the patch altogether if
you see fundamental design flaws, but try to be a bit more specific than "the
solution is not yet fully worked out". I am a big boy, rejection of this patch
will not make my world end.

Moreover, if you find attempts to make [FileUpload] a little friendlier to
non-English speaking folks "annoying", do say so. I'll never bother you again

Kind regards,

Oleg


Martin Cooper added a comment - 08/Mar/04 08:48 AM
OK, I've looked at "take 3", and if you're willing to make a few changes, I'll
commit a "take 4".

1) The new classes need to have the 2.0 license applied, since the rest of
FileUpload has been updated to the new license.

2) All of the code needs to abide by the Turbine coding guidelines. (Checkstyle
will verify this for you.) The mods to existing classes are OK in this regard,
but the new classes appear to use the Sun coding guidelines.

3) The Javadocs for all methods need to include documentation for the
parameters and return values.


Oleg Kalnichevski added a comment - 08/Mar/04 05:03 PM
Martin,
I'll do so. Just give me a couple of days.

Oleg


Oleg Kalnichevski added a comment - 10/Mar/04 02:08 AM
Created an attachment (id=10725)
Patch (take 4)

Oleg Kalnichevski added a comment - 10/Mar/04 02:19 AM
I did my best to address your concerns. Let me know if there's anything else
left to be done.

One last thing from my side: I believe it would make sense extending FileItem
interface to provide getCharSet method, as the charset is a generic property of
all multipart items. The DefaultFileItem class provides this method already

Regards,

Oleg


Martin Cooper added a comment - 12/Mar/04 03:36 PM
Fixed in the 20040312 nightly build.

I've applied "take 4", with a few minor changes to make Checkstyle a little
happier.


Ralf Hauser added a comment - 05/Apr/05 03:00 AM
see also see also Bug 34291 for the case where the browsers still do not declare the
charset, but you know it is not being the hard-coded ISO-8859-1 for the case where the browsers still do not declare the
charset, but you know it is not being the hard-coded ISO-8859-1

Ralf Hauser added a comment - 05/Apr/05 02:04 PM
see related bug in firefox https://bugzilla.mozilla.org/show_bug.cgi?id=289060
and RFE see related bug in firefox https://bugzilla.mozilla.org/show_bug.cgi?id=289060
and RFE Bug 34297 (to gracefully handle the transition in struts until all
browsers do use the charset) (to gracefully handle the transition in struts until all
browsers do use the charset)

Ralf Hauser added a comment - 14/Apr/05 03:22 PM
a patch that makes struts benefit from this endeavour a patch that makes struts benefit from this endeavour Bug 34445