Uploaded image for project: 'James Mime4j'
  1. James Mime4j
  2. MIME4J-34

o.a.j.m.message.Header#writeTo violates RFC 822

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.3, 0.4
    • 0.6
    • None
    • None

    Description

      The Header#writeTo method uses the content charset instead of US-ASCII required by the RFC 822. Same problem exists in the Multipart#writeTo.

      Attachments

        1. mimeheader.patch
          16 kB
          Oleg Kalnichevski

        Issue Links

          Activity

            This issue has been substantially addressed. Some tidying up and verification testing is required but this will be left until the next release.

            robertburrelldonkin Robert Burrell Donkin added a comment - This issue has been substantially addressed. Some tidying up and verification testing is required but this will be left until the next release.

            The code fragment Robert referred to in his comment from 10/Feb/08 seems to have gone with the "Second patch from MIM4J-5" (http://svn.apache.org/viewvc?view=rev&revision=674206).

            Are there still any other round trip problems or can this issue be closed?

            wmax Markus Wiederkehr added a comment - The code fragment Robert referred to in his comment from 10/Feb/08 seems to have gone with the "Second patch from MIM4J-5" ( http://svn.apache.org/viewvc?view=rev&revision=674206 ). Are there still any other round trip problems or can this issue be closed?

            I think the title of this issue is too generic: can anyone update this according to the specific issue and the current mime4j status?

            The fact that writeTo produces sequences violating the rfc822 is also discussed in MIME4J-60 (even if MIME4J-60 should be probably splitted in parsing issues and DOM's writeTo issues).

            bago Stefano Bagnara added a comment - I think the title of this issue is too generic: can anyone update this according to the specific issue and the current mime4j status? The fact that writeTo produces sequences violating the rfc822 is also discussed in MIME4J-60 (even if MIME4J-60 should be probably splitted in parsing issues and DOM's writeTo issues).

            The round trip encoding still needs to be resolved but this is likely to be a little delicate so propose leaving this till 0.5

            robertburrelldonkin Robert Burrell Donkin added a comment - The round trip encoding still needs to be resolved but this is likely to be a little delicate so propose leaving this till 0.5

            Do we need another mode to support naive conversion? (char -> byte)

            robertburrelldonkin Robert Burrell Donkin added a comment - Do we need another mode to support naive conversion? (char -> byte)

            Applied revised patch. Many thanks.

            Plan to leave this issue open to allow discussion of round tripping.

            robertburrelldonkin Robert Burrell Donkin added a comment - Applied revised patch. Many thanks. Plan to leave this issue open to allow discussion of round tripping.

            I'm going to apply this patch.

            Note that I suspect that accurate round-tripping is currently broken. I think that MimeTokenStream uses:

            int curr = 0;
            int prev = 0;
            while ((curr = cursor.advance()) != -1) {
            if (curr == '\n' && (prev == '\n' || prev == 0))

            { /* * [\r]\n[\r]\n or an immediate \r\n have been seen. */ sb.deleteCharAt(sb.length() - 1); break; }

            sb.append((char) curr);
            prev = curr == '\r' ? prev : curr;
            }

            which does a naive (and inefficient) cast conversion. So, to round trip I think that the output would need to perform the reverse operation, scanning each character in the string and casting it to a byte before pushing it into the buffer. This option is not available ATM.

            Robert

            robertburrelldonkin Robert Burrell Donkin added a comment - I'm going to apply this patch. Note that I suspect that accurate round-tripping is currently broken. I think that MimeTokenStream uses: int curr = 0; int prev = 0; while ((curr = cursor.advance()) != -1) { if (curr == '\n' && (prev == '\n' || prev == 0)) { /* * [\r]\n[\r]\n or an immediate \r\n have been seen. */ sb.deleteCharAt(sb.length() - 1); break; } sb.append((char) curr); prev = curr == '\r' ? prev : curr; } which does a naive (and inefficient) cast conversion. So, to round trip I think that the output would need to perform the reverse operation, scanning each character in the string and casting it to a byte before pushing it into the buffer. This option is not available ATM. Robert

            Oleg - RFC2047 defines a custom codec for message headers. Geronimo should have an basic implementation.

            robertburrelldonkin Robert Burrell Donkin added a comment - Oleg - RFC2047 defines a custom codec for message headers. Geronimo should have an basic implementation.

            > The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax.

            Robert,

            I am not sure that would work, as the encoding is meant to apply to message elements such as header bodies, and not the whole data stream.

            > Maybe geronimo has its owm implementation of MailUtility?

            Stefano,

            Just for the record, Commons Codec provides both Base64 and quoted-printable codes. Unfortunately Codec cannot work with streams, but this does not seem to be very relevant in this particular case.

            Oleg

            olegk Oleg Kalnichevski added a comment - > The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax. Robert, I am not sure that would work, as the encoding is meant to apply to message elements such as header bodies, and not the whole data stream. > Maybe geronimo has its owm implementation of MailUtility? Stefano, Just for the record, Commons Codec provides both Base64 and quoted-printable codes. Unfortunately Codec cannot work with streams, but this does not seem to be very relevant in this particular case. Oleg

            The new version of the patch adds support for different protocol compliance levels:

            • STRICT_ERROR: ASCII charset is used when writing out multipart boundary elements; the use of invalid characters causes an MimeException
            • STRICT_IGNORE: ASCII charset is used when writing out multipart boundary elements; invalid characters are silently ignored (replaced with ?)
            • LENIENT: content charset is used when writing out multipart boundary elements;

            Lenient mode is used per default.

            Please review and let me know what you think

            Oleg

            olegk Oleg Kalnichevski added a comment - The new version of the patch adds support for different protocol compliance levels: STRICT_ERROR: ASCII charset is used when writing out multipart boundary elements; the use of invalid characters causes an MimeException STRICT_IGNORE: ASCII charset is used when writing out multipart boundary elements; invalid characters are silently ignored (replaced with ?) LENIENT: content charset is used when writing out multipart boundary elements; Lenient mode is used per default. Please review and let me know what you think Oleg

            Ok, I know you probably don't want to add javamail dependencies, but javax.mail.internet.MailUtility contains method to encode/decode text also for the rfc2047 format.

            Maybe geronimo has its owm implementation of MailUtility? I lost 10 minutes in geronimo repository and geronimo website but I'm unable to find the svn location for their latest implementation

            bago Stefano Bagnara added a comment - Ok, I know you probably don't want to add javamail dependencies, but javax.mail.internet.MailUtility contains method to encode/decode text also for the rfc2047 format. Maybe geronimo has its owm implementation of MailUtility? I lost 10 minutes in geronimo repository and geronimo website but I'm unable to find the svn location for their latest implementation

            Thanks Stefano

            RFC2047 covers international encoding in detail. The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax. Otherwise round tripping will fail. This would be a reasonable amount of work. It's also a shame that it's not widely supported.

            For email, it's important that the original byte sequence is retained to ensure that round tripping works correctly. This probably means some additional work in the parser.

            But having had talking around this point, IMHO a patch that allows flexbility is important. IMHO should be ok to commit a patch as outlined by Oleg.

            Robert

            robertburrelldonkin Robert Burrell Donkin added a comment - Thanks Stefano RFC2047 covers international encoding in detail. The strictly correct approach would be to create a charset implementation that understood how to encode and decode this syntax. Otherwise round tripping will fail. This would be a reasonable amount of work. It's also a shame that it's not widely supported. For email, it's important that the original byte sequence is retained to ensure that round tripping works correctly. This probably means some additional work in the parser. But having had talking around this point, IMHO a patch that allows flexbility is important. IMHO should be ok to commit a patch as outlined by Oleg. Robert

            Robert,

            In the HttpComponents / HttpClient project we tend to be lenient about standards when parsing incoming messages but strict when generating outgoing. Probably this approach may not apply to email, since often messages are relayed through a chain of MTAs with various levels of protocol compliance. You know it better. Anyways, I'll try to put together another patch in the coming days

            Oleg

            olegk Oleg Kalnichevski added a comment - Robert, In the HttpComponents / HttpClient project we tend to be lenient about standards when parsing incoming messages but strict when generating outgoing. Probably this approach may not apply to email, since often messages are relayed through a chain of MTAs with various levels of protocol compliance. You know it better. Anyways, I'll try to put together another patch in the coming days Oleg

            The base64 and quoted-printable IMHO refers to RFC2047 (http://www.faqs.org/rfcs/rfc2047.html).
            The format for the encoded header values is:
            =?#charset#?#encoding#?#encoded-text#?=
            I don't know what is the client support for other headers, but it works for Subjects.

            I know of some mailers (MUA) that only recognize this format only when it is used for the WHOLE header value and not for the single encoded word.

            Some mailer (MUA) also accept 8bit data in the headers and threat them using the same charset declared by the headers. This is not an RFC compliant behaviour, but I think this happened very much in oriental countries in the past years so many MUA supports this to increase the compatibility with "badly formatted" emails.

            IMHO a "leave as is" option should be made available. Sometimes we want to alter a message adding some header without altering any other bit. (maybe this is your "international-lenient content+IGNORE" ?).

            bago Stefano Bagnara added a comment - The base64 and quoted-printable IMHO refers to RFC2047 ( http://www.faqs.org/rfcs/rfc2047.html ). The format for the encoded header values is: =?#charset#?#encoding#?#encoded-text#?= I don't know what is the client support for other headers, but it works for Subjects. I know of some mailers (MUA) that only recognize this format only when it is used for the WHOLE header value and not for the single encoded word. Some mailer (MUA) also accept 8bit data in the headers and threat them using the same charset declared by the headers. This is not an RFC compliant behaviour, but I think this happened very much in oriental countries in the past years so many MUA supports this to increase the compatibility with "badly formatted" emails. IMHO a "leave as is" option should be made available. Sometimes we want to alter a message adding some header without altering any other bit. (maybe this is your "international-lenient content+IGNORE" ?).
            robertburrelldonkin Robert Burrell Donkin added a comment - - edited

            "If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways."

            This sounds good to me. Probably 4 modes would be good (but i can patch any you miss):

            standard-strict should be ASCII+ERROR (when a string contains non-ASCII characters, the operation fails)

            standard-lenient would be ASCII+REPLACE (or IGNORE?)

            international-strict would content+ERROR

            international-lenient would content+REPLACE (or IGNORE?)

            AIUI current default is international-lenient

            Robert

            robertburrelldonkin Robert Burrell Donkin added a comment - - edited "If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways." This sounds good to me. Probably 4 modes would be good (but i can patch any you miss): standard-strict should be ASCII+ERROR (when a string contains non-ASCII characters, the operation fails) standard-lenient would be ASCII+REPLACE (or IGNORE?) international-strict would content+ERROR international-lenient would content+REPLACE (or IGNORE?) AIUI current default is international-lenient Robert

            "as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible"

            yes

            "and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings."

            no - AIUI non-ASCII header values not expected at all by the specification. i do not know of any way for a client to discover such a transfer encoding.

            there is considerable potential interoperability impact on email (where international mail client often interpret the headers using the content-encoding) by this patch.

            • robert
            robertburrelldonkin Robert Burrell Donkin added a comment - "as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible" yes "and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings." no - AIUI non-ASCII header values not expected at all by the specification. i do not know of any way for a client to discover such a transfer encoding. there is considerable potential interoperability impact on email (where international mail client often interpret the headers using the content-encoding) by this patch. robert

            The new version of the patch also makes sure Multipart class uses ASCII charset when writing preamble, boundary, and epilogue elements to the output stream. Please note this does not affect the encoding of the body parts.

            Oleg

            olegk Oleg Kalnichevski added a comment - The new version of the patch also makes sure Multipart class uses ASCII charset when writing preamble, boundary, and epilogue elements to the output stream. Please note this does not affect the encoding of the body parts. Oleg

            Robert, et al

            Many browsers indeed use content charset to encode multipart header values, so we will have to provide a browser compatible mode in HttpClient. However, as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings.

            ====

            field = field-name ":" [ field-body ] CRLF

            field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":">

            field-body = field-body-contents
            [CRLF LWSP-char field-body]

            field-body-contents =
            <the ASCII characters making up the field-body, as
            defined in the following sections, and consisting
            of combinations of atom, quoted-string, and
            specials tokens, or else consisting of texts>
            ====

            If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways.

            Oleg

            olegk Oleg Kalnichevski added a comment - Robert, et al Many browsers indeed use content charset to encode multipart header values, so we will have to provide a browser compatible mode in HttpClient. However, as far as the strict interpretation of the MIME standard is concerned only ASCII characters seem permissible and non-ASCII are expected to be escaped using BASE64 or Quoted-Printable encodings. ==== field = field-name ":" [ field-body ] CRLF field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":"> field-body = field-body-contents [CRLF LWSP-char field-body] field-body-contents = <the ASCII characters making up the field-body, as defined in the following sections, and consisting of combinations of atom, quoted-string, and specials tokens, or else consisting of texts> ==== If you like I can implement two modes for Header#writeTo and Multipart#writeTo methods: strict (ASCII) and lenient (content charset). We'll need this feature in HttpClient anyways. Oleg

            WRT Multipart, it seems wrong to ignore the transfer encoding

            Again, hopefully people will jump in with more opinions

            robertburrelldonkin Robert Burrell Donkin added a comment - WRT Multipart, it seems wrong to ignore the transfer encoding Again, hopefully people will jump in with more opinions

            I wonder whether this might have some wrinkles

            There's RFC822 to consider (as well as 2045)

            The header name MUST be US-ASCII. The field body may be composed of US-ASCII characters barring CR and LF.

            IIRC practically, an email client is at liberty to interpret these characters as they please. Many choose to interpret them according to the encoding.

            AIUI when the field body content contains non-US-ASCII characters, if US-ASCII is forced then they will written out in a way that's unintelligable for any client. If content charset is used, then some email clients may be able to interpret them correctly. However, the body will need to be checked for bitwise line breaks. (The output method uses a string which is inefficient.)

            But I'm not an expert - hopefully someone who is will jump in...

            At the least, it seems unreasonable not to give the user the option to use US-ASCII or content-type encoding

            If you have a particular example in mind where content-encoding produces a bad result, it would be useful if you could explain it.

            robertburrelldonkin Robert Burrell Donkin added a comment - I wonder whether this might have some wrinkles There's RFC822 to consider (as well as 2045) The header name MUST be US-ASCII. The field body may be composed of US-ASCII characters barring CR and LF. IIRC practically, an email client is at liberty to interpret these characters as they please. Many choose to interpret them according to the encoding. AIUI when the field body content contains non-US-ASCII characters, if US-ASCII is forced then they will written out in a way that's unintelligable for any client. If content charset is used, then some email clients may be able to interpret them correctly. However, the body will need to be checked for bitwise line breaks. (The output method uses a string which is inefficient.) But I'm not an expert - hopefully someone who is will jump in... At the least, it seems unreasonable not to give the user the option to use US-ASCII or content-type encoding If you have a particular example in mind where content-encoding produces a bad result, it would be useful if you could explain it.

            The patch (attached) should fix the problem. I'll submit a patch for the Multipart as soon as MIME4J-32 is resolved. Please review.

            Oleg

            olegk Oleg Kalnichevski added a comment - The patch (attached) should fix the problem. I'll submit a patch for the Multipart as soon as MIME4J-32 is resolved. Please review. Oleg

            People

              Unassigned Unassigned
              olegk Oleg Kalnichevski
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: