[MIME4J-34] o.a.j.m.message.Header#writeTo violates RFC 822 - ASF JIRA

Robert Burrell Donkin added a comment - 06/Feb/09 21:07

This issue has been substantially addressed. Some tidying up and verification testing is required but this will be left until the next release.

Robert Burrell Donkin added a comment - 06/Feb/09 21:07 This issue has been substantially addressed. Some tidying up and verification testing is required but this will be left until the next release.

Markus Wiederkehr added a comment - 01/Feb/09 20:00

The code fragment Robert referred to in his comment from 10/Feb/08 seems to have gone with the "Second patch from MIM4J-5" (http://svn.apache.org/viewvc?view=rev&revision=674206).

Are there still any other round trip problems or can this issue be closed?

Markus Wiederkehr added a comment - 01/Feb/09 20:00 The code fragment Robert referred to in his comment from 10/Feb/08 seems to have gone with the "Second patch from MIM4J-5" ( http://svn.apache.org/viewvc?view=rev&revision=674206 ). Are there still any other round trip problems or can this issue be closed?

Stefano Bagnara added a comment - 23/Jul/08 18:28

I think the title of this issue is too generic: can anyone update this according to the specific issue and the current mime4j status?

The fact that writeTo produces sequences violating the rfc822 is also discussed in ~~MIME4J-60~~ (even if ~~MIME4J-60~~ should be probably splitted in parsing issues and DOM's writeTo issues).

Stefano Bagnara added a comment - 23/Jul/08 18:28 I think the title of this issue is too generic: can anyone update this according to the specific issue and the current mime4j status? The fact that writeTo produces sequences violating the rfc822 is also discussed in MIME4J-60 (even if MIME4J-60 should be probably splitted in parsing issues and DOM's writeTo issues).

Robert Burrell Donkin added a comment - 25/May/08 13:58

The round trip encoding still needs to be resolved but this is likely to be a little delicate so propose leaving this till 0.5

Robert Burrell Donkin added a comment - 25/May/08 13:58 The round trip encoding still needs to be resolved but this is likely to be a little delicate so propose leaving this till 0.5

Robert Burrell Donkin added a comment - 10/Feb/08 11:31

Do we need another mode to support naive conversion? (char -> byte)

Robert Burrell Donkin added a comment - 10/Feb/08 11:31 Do we need another mode to support naive conversion? (char -> byte)

Robert Burrell Donkin added a comment - 10/Feb/08 11:30

Applied revised patch. Many thanks.

Plan to leave this issue open to allow discussion of round tripping.

Robert Burrell Donkin added a comment - 10/Feb/08 11:30 Applied revised patch. Many thanks. Plan to leave this issue open to allow discussion of round tripping.

Robert Burrell Donkin added a comment - 10/Feb/08 11:27

I'm going to apply this patch.

Note that I suspect that accurate round-tripping is currently broken. I think that MimeTokenStream uses:

int curr = 0;
int prev = 0;
while ((curr = cursor.advance()) != -1) {
if (curr == '\n' && (prev == '\n' || prev == 0))

{ /* * [\r]\n[\r]\n or an immediate \r\n have been seen. */ sb.deleteCharAt(sb.length() - 1); break; }

sb.append((char) curr);
prev = curr == '\r' ? prev : curr;
}

which does a naive (and inefficient) cast conversion. So, to round trip I think that the output would need to perform the reverse operation, scanning each character in the string and casting it to a byte before pushing it into the buffer. This option is not available ATM.

Robert

Robert Burrell Donkin added a comment - 10/Feb/08 11:27 I'm going to apply this patch. Note that I suspect that accurate round-tripping is currently broken. I think that MimeTokenStream uses: int curr = 0; int prev = 0; while ((curr = cursor.advance()) != -1) { if (curr == '\n' && (prev == '\n' || prev == 0)) { /* * [\r]\n[\r]\n or an immediate \r\n have been seen. */ sb.deleteCharAt(sb.length() - 1); break; } sb.append((char) curr); prev = curr == '\r' ? prev : curr; } which does a naive (and inefficient) cast conversion. So, to round trip I think that the output would need to perform the reverse operation, scanning each character in the string and casting it to a byte before pushing it into the buffer. This option is not available ATM. Robert

Robert Burrell Donkin added a comment - 09/Feb/08 07:16

Oleg - RFC2047 defines a custom codec for message headers. Geronimo should have an basic implementation.

Robert Burrell Donkin added a comment - 09/Feb/08 07:16 Oleg - RFC2047 defines a custom codec for message headers. Geronimo should have an basic implementation.

Oleg Kalnichevski added a comment - 08/Feb/08 16:00

> The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax.

Robert,

I am not sure that would work, as the encoding is meant to apply to message elements such as header bodies, and not the whole data stream.

> Maybe geronimo has its owm implementation of MailUtility?

Stefano,

Just for the record, Commons Codec provides both Base64 and quoted-printable codes. Unfortunately Codec cannot work with streams, but this does not seem to be very relevant in this particular case.

Oleg

Oleg Kalnichevski added a comment - 08/Feb/08 16:00 > The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax. Robert, I am not sure that would work, as the encoding is meant to apply to message elements such as header bodies, and not the whole data stream. > Maybe geronimo has its owm implementation of MailUtility? Stefano, Just for the record, Commons Codec provides both Base64 and quoted-printable codes. Unfortunately Codec cannot work with streams, but this does not seem to be very relevant in this particular case. Oleg

Oleg Kalnichevski added a comment - 08/Feb/08 15:40

The new version of the patch adds support for different protocol compliance levels:

STRICT_ERROR: ASCII charset is used when writing out multipart boundary elements; the use of invalid characters causes an MimeException
STRICT_IGNORE: ASCII charset is used when writing out multipart boundary elements; invalid characters are silently ignored (replaced with ?)
LENIENT: content charset is used when writing out multipart boundary elements;

Lenient mode is used per default.

Please review and let me know what you think

Oleg

Oleg Kalnichevski added a comment - 08/Feb/08 15:40 The new version of the patch adds support for different protocol compliance levels: STRICT_ERROR: ASCII charset is used when writing out multipart boundary elements; the use of invalid characters causes an MimeException STRICT_IGNORE: ASCII charset is used when writing out multipart boundary elements; invalid characters are silently ignored (replaced with ?) LENIENT: content charset is used when writing out multipart boundary elements; Lenient mode is used per default. Please review and let me know what you think Oleg

Stefano Bagnara added a comment - 08/Feb/08 10:14

Ok, I know you probably don't want to add javamail dependencies, but javax.mail.internet.MailUtility contains method to encode/decode text also for the rfc2047 format.

Maybe geronimo has its owm implementation of MailUtility? I lost 10 minutes in geronimo repository and geronimo website but I'm unable to find the svn location for their latest implementation

Stefano Bagnara added a comment - 08/Feb/08 10:14 Ok, I know you probably don't want to add javamail dependencies, but javax.mail.internet.MailUtility contains method to encode/decode text also for the rfc2047 format. Maybe geronimo has its owm implementation of MailUtility? I lost 10 minutes in geronimo repository and geronimo website but I'm unable to find the svn location for their latest implementation

Robert Burrell Donkin added a comment - 07/Feb/08 21:45

Thanks Stefano

RFC2047 covers international encoding in detail. The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax. Otherwise round tripping will fail. This would be a reasonable amount of work. It's also a shame that it's not widely supported.

For email, it's important that the original byte sequence is retained to ensure that round tripping works correctly. This probably means some additional work in the parser.

But having had talking around this point, IMHO a patch that allows flexbility is important. IMHO should be ok to commit a patch as outlined by Oleg.

Robert

Robert Burrell Donkin added a comment - 07/Feb/08 21:45 Thanks Stefano RFC2047 covers international encoding in detail. The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax. Otherwise round tripping will fail. This would be a reasonable amount of work. It's also a shame that it's not widely supported. For email, it's important that the original byte sequence is retained to ensure that round tripping works correctly. This probably means some additional work in the parser. But having had talking around this point, IMHO a patch that allows flexbility is important. IMHO should be ok to commit a patch as outlined by Oleg. Robert

Oleg Kalnichevski added a comment - 05/Feb/08 23:15

Robert,

In the HttpComponents / HttpClient project we tend to be lenient about standards when parsing incoming messages but strict when generating outgoing. Probably this approach may not apply to email, since often messages are relayed through a chain of MTAs with various levels of protocol compliance. You know it better. Anyways, I'll try to put together another patch in the coming days

Oleg

Oleg Kalnichevski added a comment - 05/Feb/08 23:15 Robert, In the HttpComponents / HttpClient project we tend to be lenient about standards when parsing incoming messages but strict when generating outgoing. Probably this approach may not apply to email, since often messages are relayed through a chain of MTAs with various levels of protocol compliance. You know it better. Anyways, I'll try to put together another patch in the coming days Oleg

Stefano Bagnara added a comment - 05/Feb/08 13:53

The base64 and quoted-printable IMHO refers to RFC2047 (http://www.faqs.org/rfcs/rfc2047.html).
The format for the encoded header values is:
=?#charset#?#encoding#?#encoded-text#?=
I don't know what is the client support for other headers, but it works for Subjects.

I know of some mailers (MUA) that only recognize this format only when it is used for the WHOLE header value and not for the single encoded word.

Some mailer (MUA) also accept 8bit data in the headers and threat them using the same charset declared by the headers. This is not an RFC compliant behaviour, but I think this happened very much in oriental countries in the past years so many MUA supports this to increase the compatibility with "badly formatted" emails.

IMHO a "leave as is" option should be made available. Sometimes we want to alter a message adding some header without altering any other bit. (maybe this is your "international-lenient content+IGNORE" ?).

Stefano Bagnara added a comment - 05/Feb/08 13:53 The base64 and quoted-printable IMHO refers to RFC2047 ( http://www.faqs.org/rfcs/rfc2047.html ). The format for the encoded header values is: =?#charset#?#encoding#?#encoded-text#?= I don't know what is the client support for other headers, but it works for Subjects. I know of some mailers (MUA) that only recognize this format only when it is used for the WHOLE header value and not for the single encoded word. Some mailer (MUA) also accept 8bit data in the headers and threat them using the same charset declared by the headers. This is not an RFC compliant behaviour, but I think this happened very much in oriental countries in the past years so many MUA supports this to increase the compatibility with "badly formatted" emails. IMHO a "leave as is" option should be made available. Sometimes we want to alter a message adding some header without altering any other bit. (maybe this is your "international-lenient content+IGNORE" ?).

Robert Burrell Donkin added a comment - 04/Feb/08 21:30 - edited

"If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways."

This sounds good to me. Probably 4 modes would be good (but i can patch any you miss):

standard-strict should be ASCII+ERROR (when a string contains non-ASCII characters, the operation fails)

standard-lenient would be ASCII+REPLACE (or IGNORE?)

international-strict would content+ERROR

international-lenient would content+REPLACE (or IGNORE?)

AIUI current default is international-lenient

Robert

Robert Burrell Donkin added a comment - 04/Feb/08 21:30 - edited "If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways." This sounds good to me. Probably 4 modes would be good (but i can patch any you miss): standard-strict should be ASCII+ERROR (when a string contains non-ASCII characters, the operation fails) standard-lenient would be ASCII+REPLACE (or IGNORE?) international-strict would content+ERROR international-lenient would content+REPLACE (or IGNORE?) AIUI current default is international-lenient Robert

Robert Burrell Donkin added a comment - 04/Feb/08 21:27

"as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible"

yes

"and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings."

no - AIUI non-ASCII header values not expected at all by the specification. i do not know of any way for a client to discover such a transfer encoding.

there is considerable potential interoperability impact on email (where international mail client often interpret the headers using the content-encoding) by this patch.

robert

Robert Burrell Donkin added a comment - 04/Feb/08 21:27 "as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible" yes "and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings." no - AIUI non-ASCII header values not expected at all by the specification. i do not know of any way for a client to discover such a transfer encoding. there is considerable potential interoperability impact on email (where international mail client often interpret the headers using the content-encoding) by this patch. robert

Oleg Kalnichevski added a comment - 29/Jan/08 23:34

The new version of the patch also makes sure Multipart class uses ASCII charset when writing preamble, boundary, and epilogue elements to the output stream. Please note this does not affect the encoding of the body parts.

Oleg

Oleg Kalnichevski added a comment - 29/Jan/08 23:34 The new version of the patch also makes sure Multipart class uses ASCII charset when writing preamble, boundary, and epilogue elements to the output stream. Please note this does not affect the encoding of the body parts. Oleg

Oleg Kalnichevski added a comment - 29/Jan/08 22:15

Robert, et al

Many browsers indeed use content charset to encode multipart header values, so we will have to provide a browser compatible mode in HttpClient. However, as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings.

====

field = field-name ":" [ field-body ] CRLF

field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":">

field-body = field-body-contents
[CRLF LWSP-char field-body]

field-body-contents =
<the ASCII characters making up the field-body, as
defined in the following sections, and consisting
of combinations of atom, quoted-string, and
specials tokens, or else consisting of texts>
====

If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways.

Oleg

Oleg Kalnichevski added a comment - 29/Jan/08 22:15 Robert, et al Many browsers indeed use content charset to encode multipart header values, so we will have to provide a browser compatible mode in HttpClient. However, as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings. ==== field = field-name ":" [ field-body ] CRLF field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":"> field-body = field-body-contents [CRLF LWSP-char field-body] field-body-contents = <the ASCII characters making up the field-body, as defined in the following sections, and consisting of combinations of atom, quoted-string, and specials tokens, or else consisting of texts> ==== If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways. Oleg

Robert Burrell Donkin added a comment - 29/Jan/08 21:56

WRT Multipart, it seems wrong to ignore the transfer encoding

Again, hopefully people will jump in with more opinions

Robert Burrell Donkin added a comment - 29/Jan/08 21:56 WRT Multipart, it seems wrong to ignore the transfer encoding Again, hopefully people will jump in with more opinions

Robert Burrell Donkin added a comment - 29/Jan/08 21:55

I wonder whether this might have some wrinkles

There's RFC822 to consider (as well as 2045)

The header name MUST be US-ASCII. The field body may be composed of US-ASCII characters barring CR and LF.

IIRC practically, an email client is at liberty to interpret these characters as they please. Many choose to interpret them according to the encoding.

AIUI when the field body content contains non-US-ASCII characters, if US-ASCII is forced then they will written out in a way that's unintelligable for any client. If content charset is used, then some email clients may be able to interpret them correctly. However, the body will need to be checked for bitwise line breaks. (The output method uses a string which is inefficient.)

But I'm not an expert - hopefully someone who is will jump in...

At the least, it seems unreasonable not to give the user the option to use US-ASCII or content-type encoding

If you have a particular example in mind where content-encoding produces a bad result, it would be useful if you could explain it.

Robert Burrell Donkin added a comment - 29/Jan/08 21:55 I wonder whether this might have some wrinkles There's RFC822 to consider (as well as 2045) The header name MUST be US-ASCII. The field body may be composed of US-ASCII characters barring CR and LF. IIRC practically, an email client is at liberty to interpret these characters as they please. Many choose to interpret them according to the encoding. AIUI when the field body content contains non-US-ASCII characters, if US-ASCII is forced then they will written out in a way that's unintelligable for any client. If content charset is used, then some email clients may be able to interpret them correctly. However, the body will need to be checked for bitwise line breaks. (The output method uses a string which is inefficient.) But I'm not an expert - hopefully someone who is will jump in... At the least, it seems unreasonable not to give the user the option to use US-ASCII or content-type encoding If you have a particular example in mind where content-encoding produces a bad result, it would be useful if you could explain it.

Oleg Kalnichevski added a comment - 28/Jan/08 13:03

The patch (attached) should fix the problem. I'll submit a patch for the Multipart as soon as ~~MIME4J-32~~ is resolved. Please review.

Oleg

Oleg Kalnichevski added a comment - 28/Jan/08 13:03 The patch (attached) should fix the problem. I'll submit a patch for the Multipart as soon as MIME4J-32 is resolved. Please review. Oleg

James Mime4j

o.a.j.m.message.Header#writeTo violates RFC 822

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates