Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.3, 0.4
-
None
-
None
Description
The Header#writeTo method uses the content charset instead of US-ASCII required by the RFC 822. Same problem exists in the Multipart#writeTo.
Attachments
Attachments
- mimeheader.patch
- 16 kB
- Oleg Kalnichevski
Issue Links
- relates to
-
MIME4J-112 Define Limits Of Round Tripping In Mime4J
- Open
Activity
The code fragment Robert referred to in his comment from 10/Feb/08 seems to have gone with the "Second patch from MIM4J-5" (http://svn.apache.org/viewvc?view=rev&revision=674206).
Are there still any other round trip problems or can this issue be closed?
I think the title of this issue is too generic: can anyone update this according to the specific issue and the current mime4j status?
The fact that writeTo produces sequences violating the rfc822 is also discussed in MIME4J-60 (even if MIME4J-60 should be probably splitted in parsing issues and DOM's writeTo issues).
The round trip encoding still needs to be resolved but this is likely to be a little delicate so propose leaving this till 0.5
Do we need another mode to support naive conversion? (char -> byte)
Applied revised patch. Many thanks.
Plan to leave this issue open to allow discussion of round tripping.
I'm going to apply this patch.
Note that I suspect that accurate round-tripping is currently broken. I think that MimeTokenStream uses:
int curr = 0;
int prev = 0;
while ((curr = cursor.advance()) != -1) {
if (curr == '\n' && (prev == '\n' || prev == 0))
sb.append((char) curr);
prev = curr == '\r' ? prev : curr;
}
which does a naive (and inefficient) cast conversion. So, to round trip I think that the output would need to perform the reverse operation, scanning each character in the string and casting it to a byte before pushing it into the buffer. This option is not available ATM.
Robert
Oleg - RFC2047 defines a custom codec for message headers. Geronimo should have an basic implementation.
> The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax.
Robert,
I am not sure that would work, as the encoding is meant to apply to message elements such as header bodies, and not the whole data stream.
> Maybe geronimo has its owm implementation of MailUtility?
Stefano,
Just for the record, Commons Codec provides both Base64 and quoted-printable codes. Unfortunately Codec cannot work with streams, but this does not seem to be very relevant in this particular case.
Oleg
The new version of the patch adds support for different protocol compliance levels:
- STRICT_ERROR: ASCII charset is used when writing out multipart boundary elements; the use of invalid characters causes an MimeException
- STRICT_IGNORE: ASCII charset is used when writing out multipart boundary elements; invalid characters are silently ignored (replaced with ?)
- LENIENT: content charset is used when writing out multipart boundary elements;
Lenient mode is used per default.
Please review and let me know what you think
Oleg
Ok, I know you probably don't want to add javamail dependencies, but javax.mail.internet.MailUtility contains method to encode/decode text also for the rfc2047 format.
Maybe geronimo has its owm implementation of MailUtility? I lost 10 minutes in geronimo repository and geronimo website but I'm unable to find the svn location for their latest implementation
Thanks Stefano
RFC2047 covers international encoding in detail. The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax. Otherwise round tripping will fail. This would be a reasonable amount of work. It's also a shame that it's not widely supported.
For email, it's important that the original byte sequence is retained to ensure that round tripping works correctly. This probably means some additional work in the parser.
But having had talking around this point, IMHO a patch that allows flexbility is important. IMHO should be ok to commit a patch as outlined by Oleg.
Robert
Robert,
In the HttpComponents / HttpClient project we tend to be lenient about standards when parsing incoming messages but strict when generating outgoing. Probably this approach may not apply to email, since often messages are relayed through a chain of MTAs with various levels of protocol compliance. You know it better. Anyways, I'll try to put together another patch in the coming days
Oleg
The base64 and quoted-printable IMHO refers to RFC2047 (http://www.faqs.org/rfcs/rfc2047.html).
The format for the encoded header values is:
=?#charset#?#encoding#?#encoded-text#?=
I don't know what is the client support for other headers, but it works for Subjects.
I know of some mailers (MUA) that only recognize this format only when it is used for the WHOLE header value and not for the single encoded word.
Some mailer (MUA) also accept 8bit data in the headers and threat them using the same charset declared by the headers. This is not an RFC compliant behaviour, but I think this happened very much in oriental countries in the past years so many MUA supports this to increase the compatibility with "badly formatted" emails.
IMHO a "leave as is" option should be made available. Sometimes we want to alter a message adding some header without altering any other bit. (maybe this is your "international-lenient content+IGNORE" ?).
"If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways."
This sounds good to me. Probably 4 modes would be good (but i can patch any you miss):
standard-strict should be ASCII+ERROR (when a string contains non-ASCII characters, the operation fails)
standard-lenient would be ASCII+REPLACE (or IGNORE?)
international-strict would content+ERROR
international-lenient would content+REPLACE (or IGNORE?)
AIUI current default is international-lenient
Robert
"as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible"
yes
"and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings."
no - AIUI non-ASCII header values not expected at all by the specification. i do not know of any way for a client to discover such a transfer encoding.
there is considerable potential interoperability impact on email (where international mail client often interpret the headers using the content-encoding) by this patch.
- robert
The new version of the patch also makes sure Multipart class uses ASCII charset when writing preamble, boundary, and epilogue elements to the output stream. Please note this does not affect the encoding of the body parts.
Oleg
Robert, et al
Many browsers indeed use content charset to encode multipart header values, so we will have to provide a browser compatible mode in HttpClient. However, as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings.
====
field = field-name ":" [ field-body ] CRLF
field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":">
field-body = field-body-contents
[CRLF LWSP-char field-body]
field-body-contents =
<the ASCII characters making up the field-body, as
defined in the following sections, and consisting
of combinations of atom, quoted-string, and
specials tokens, or else consisting of texts>
====
If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways.
Oleg
WRT Multipart, it seems wrong to ignore the transfer encoding
Again, hopefully people will jump in with more opinions
I wonder whether this might have some wrinkles
There's RFC822 to consider (as well as 2045)
The header name MUST be US-ASCII. The field body may be composed of US-ASCII characters barring CR and LF.
IIRC practically, an email client is at liberty to interpret these characters as they please. Many choose to interpret them according to the encoding.
AIUI when the field body content contains non-US-ASCII characters, if US-ASCII is forced then they will written out in a way that's unintelligable for any client. If content charset is used, then some email clients may be able to interpret them correctly. However, the body will need to be checked for bitwise line breaks. (The output method uses a string which is inefficient.)
But I'm not an expert - hopefully someone who is will jump in...
At the least, it seems unreasonable not to give the user the option to use US-ASCII or content-type encoding
If you have a particular example in mind where content-encoding produces a bad result, it would be useful if you could explain it.
The patch (attached) should fix the problem. I'll submit a patch for the Multipart as soon as MIME4J-32 is resolved. Please review.
Oleg
This issue has been substantially addressed. Some tidying up and verification testing is required but this will be left until the next release.