James Mime4j
  1. James Mime4j
  2. MIME4J-203

RawField parsing problem in case of '\t' delimiter

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.7.1
    • Component/s: dom
    • Labels:
      None
    • Environment:
      linux 3.0.1, x86_64

      java version "1.6.0_26"
      Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
      Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

      Description

      RawField doesn't drop '\t' char between field header and body, so body starts from LWSP-char that should be ignored by RFC822 (see 3.4.2 in spec) because date field is structured. So date isn't extracted.

      1. patch
        0.8 kB
        Konstantin Gribov

        Activity

        Hide
        Oleg Kalnichevski added a comment -

        I am not sure I understand. The RawField is not supposed to strip any characters. It represents raw content of a field, exactly the way it was read form the mime stream. Can you describe the exact problem you are having, preferably by providing a test case?

        Oleg

        Show
        Oleg Kalnichevski added a comment - I am not sure I understand. The RawField is not supposed to strip any characters. It represents raw content of a field, exactly the way it was read form the mime stream. Can you describe the exact problem you are having, preferably by providing a test case? Oleg
        Hide
        Konstantin Gribov added a comment -

        Structured fields use RawField.getBody() to extract field body content. This method omits only space symbol after field header and colon symbol, but in any structured field it should omit any linear-white-space-char sequence. So field like "Date:\tFri, 02 Sep 2011 00:00:00 +0300" should be equivalent to "Date: Fri, 02 Sep 2011 00:00:00 +0300", but now first variant wouldn't be parsed at all.

        Show
        Konstantin Gribov added a comment - Structured fields use RawField.getBody() to extract field body content. This method omits only space symbol after field header and colon symbol, but in any structured field it should omit any linear-white-space-char sequence. So field like "Date:\tFri, 02 Sep 2011 00:00:00 +0300" should be equivalent to "Date: Fri, 02 Sep 2011 00:00:00 +0300", but now first variant wouldn't be parsed at all.
        Hide
        Oleg Kalnichevski added a comment -

        I see the problem now.

        Oleg

        Show
        Oleg Kalnichevski added a comment - I see the problem now. Oleg
        Hide
        Konstantin Gribov added a comment -

        Patch to fix it.

        Show
        Konstantin Gribov added a comment - Patch to fix it.
        Hide
        Oleg Kalnichevski added a comment -

        Fixed in SVN trunk. Please review / re-test.

        Oleg

        Show
        Oleg Kalnichevski added a comment - Fixed in SVN trunk. Please review / re-test. Oleg
        Hide
        Stefano Bagnara added a comment -

        I don't agree with the anaysis.

        RFC2822 says that everything after the ":" is part of the body. It doesn't say anywhere that an initial FWS is not part of the body.

        Instead in "3.3. Date and Time Specification" we can see:

        date-time = [ day-of-week "," ] date FWS time [CFWS]

        day-of-week = ([FWS] day-name) / obs-day-of-week


        So you see a starting FWS is part of day-of-week, and so is part of the body.
        If the datetime field parser does not support this then it is a bug in the datetime parser, not in the RawField.

        (PS: Also obs-day-of-week allows CFWS at the beginning)

        Show
        Stefano Bagnara added a comment - I don't agree with the anaysis. RFC2822 says that everything after the ":" is part of the body. It doesn't say anywhere that an initial FWS is not part of the body. Instead in "3.3. Date and Time Specification" we can see: — date-time = [ day-of-week "," ] date FWS time [CFWS] day-of-week = ( [FWS] day-name) / obs-day-of-week So you see a starting FWS is part of day-of-week, and so is part of the body. If the datetime field parser does not support this then it is a bug in the datetime parser, not in the RawField. (PS: Also obs-day-of-week allows CFWS at the beginning)
        Hide
        Konstantin Gribov added a comment - - edited

        Just RFC822:

        3.4.2. WHITE SPACE

        Note: In structured field bodies, multiple linear space ASCII
        characters (namely HTABs and SPACEs) are treated as
        single spaces and may freely surround any symbol. In
        all header fields, the only place in which at least one
        LWSP-char is REQUIRED is at the beginning of continua-
        tion lines in a folded field.

        Show
        Konstantin Gribov added a comment - - edited Just RFC822: 3.4.2. WHITE SPACE Note: In structured field bodies, multiple linear space ASCII characters (namely HTABs and SPACEs) are treated as single spaces and may freely surround any symbol. In all header fields, the only place in which at least one LWSP-char is REQUIRED is at the beginning of continua- tion lines in a folded field.
        Hide
        Oleg Kalnichevski added a comment -

        Sorry, I overlooked your patch. Actually the best way to describe a problem is by submitting a patch with a fix. And ideally a good patch should also include a unit test.

        Oleg

        Show
        Oleg Kalnichevski added a comment - Sorry, I overlooked your patch. Actually the best way to describe a problem is by submitting a patch with a fix. And ideally a good patch should also include a unit test. Oleg
        Hide
        Oleg Kalnichevski added a comment -

        @Stefano

        The RawField#getBody used to skip the first blank after the body delimiter, but failed to do so if the character was a TAB. At the very least the behavior of the method would be consistent: either always strip the leading white space or strip no content at all.

        Oleg

        Show
        Oleg Kalnichevski added a comment - @Stefano The RawField#getBody used to skip the first blank after the body delimiter, but failed to do so if the character was a TAB. At the very least the behavior of the method would be consistent: either always strip the leading white space or strip no content at all. Oleg
        Hide
        Stefano Bagnara added a comment -

        @Jostya: I read it twice. No way it says the tab belongs to the field name and not to the body or that the tab is to be collapsed to the separator (the ":"). So it still belongs to the body. Then when you evaluate the body you follow that rule (also the rule says they have to be collapsed to a single space, not ignored, nor removed). Last thing: it is better to check latest RFC as RFC822 is obsolete. RFC5322 says:
        ------
        3.6.1. The Origination Date Field

        The origination date field consists of the field name "Date" followed
        by a date-time specification.

        orig-date = "Date:" date-time CRLF
        ------
        You see everything after "Date:" is a date-time token for the grammar. And the date-time token is defined as I previously quoted (may start with FWS).
        The obsolete rfc822 also allows spaces between "Date" and the ":" (, but BEFORE the ":").
        ------
        4.5.1. Obsolete Origination Date Field

        obs-orig-date = "Date" *WSP ":" date-time CRLF
        ------

        Everything after the ":" belong to the body and I still can't find any sentence in the rfc alluding to something else.

        @Oleg: Yes, I agree that they should be consistent (I simply said I didn't agree with the analysis of the rfc pointer provided). My main point is that the date header body parser should correctly accept initial FWS as the specification says it have to. I remember the main cause I left that space handling in RawField during last refactoring is because removing that we produced mimemessages that was uncommon because setting most header would result in "headename:headervalue" with no space, while the most common mime format found out there have a space and I believe some processing tool even doesn't happily accept an header without spaces after the ":" (it is almost a de fact standard): maybe we discussed this in the list, I don't remember right now, sorry.

        Show
        Stefano Bagnara added a comment - @Jostya: I read it twice. No way it says the tab belongs to the field name and not to the body or that the tab is to be collapsed to the separator (the ":"). So it still belongs to the body. Then when you evaluate the body you follow that rule (also the rule says they have to be collapsed to a single space, not ignored, nor removed). Last thing: it is better to check latest RFC as RFC822 is obsolete. RFC5322 says: ------ 3.6.1. The Origination Date Field The origination date field consists of the field name "Date" followed by a date-time specification. orig-date = "Date:" date-time CRLF ------ You see everything after "Date:" is a date-time token for the grammar. And the date-time token is defined as I previously quoted (may start with FWS). The obsolete rfc822 also allows spaces between "Date" and the ":" (, but BEFORE the ":"). ------ 4.5.1. Obsolete Origination Date Field obs-orig-date = "Date" *WSP ":" date-time CRLF ------ Everything after the ":" belong to the body and I still can't find any sentence in the rfc alluding to something else. @Oleg: Yes, I agree that they should be consistent (I simply said I didn't agree with the analysis of the rfc pointer provided). My main point is that the date header body parser should correctly accept initial FWS as the specification says it have to. I remember the main cause I left that space handling in RawField during last refactoring is because removing that we produced mimemessages that was uncommon because setting most header would result in "headename:headervalue" with no space, while the most common mime format found out there have a space and I believe some processing tool even doesn't happily accept an header without spaces after the ":" (it is almost a de fact standard): maybe we discussed this in the list, I don't remember right now, sorry.
        Hide
        Konstantin Gribov added a comment -

        @Stefano, first of all, take a look at my name. I see, you make one typo twice.

        The second thing, look at FWS decl in rfc5322: FWS = ([*WSP CRLF] 1*WSP) / obs-FWS. So it can contain at least one WSP (space ot horizontal tab). Than look at 3.3: date-time definition: day-of-week is prepended by optional FWS, day is surrounded by FWS (first is optional) etc. So field body should be collapsed, and then, for structured field normalized by collapsing FWS to space, as have been said in note in 2.2.

        Show
        Konstantin Gribov added a comment - @Stefano, first of all, take a look at my name. I see, you make one typo twice. The second thing, look at FWS decl in rfc5322: FWS = ( [*WSP CRLF] 1*WSP) / obs-FWS. So it can contain at least one WSP (space ot horizontal tab). Than look at 3.3: date-time definition: day-of-week is prepended by optional FWS, day is surrounded by FWS (first is optional) etc. So field body should be collapsed, and then, for structured field normalized by collapsing FWS to space, as have been said in note in 2.2.
        Hide
        Konstantin Gribov added a comment -

        Can I close issue? In current trunk this issue is fixed. The question is about this disscussion.

        Show
        Konstantin Gribov added a comment - Can I close issue? In current trunk this issue is fixed. The question is about this disscussion.
        Hide
        Stefano Bagnara added a comment -

        @Kostya (sorry for the typo): RFC clearly states that everything after the ":" is body. There is no way to read rfc differently.
        rfc5322 2.2:
        Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF.

        That said I agree that our RawBody already had some special handling for the space and so if we don't plan to really fix it we can at least make it consistent, but this doesn't change the fact that the RFC says something different about what is an header body.

        I don't have a better fix for this right now so I'm fine with the committed solution, but I wanted to aknowledge that we are not strictly doing what the RFC tells (of course the RFC doesn't tell us what classes to use and what our methods have to return, so we can simply document our current approach, maybe linking to this issue).

        In past I analyzed the behaviour of MUA wrt spaces at the start of the "Subject" header body. Every client ignores the first space, some clients trim every space at the start of the subject. So they already do something different from what RFC says and this is why I didn't find a good solution to be strict wrt RFC and still show a similar behaviour to what most clients do.

        Show
        Stefano Bagnara added a comment - @Kostya (sorry for the typo): RFC clearly states that everything after the ":" is body. There is no way to read rfc differently. rfc5322 2.2: Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF. That said I agree that our RawBody already had some special handling for the space and so if we don't plan to really fix it we can at least make it consistent, but this doesn't change the fact that the RFC says something different about what is an header body. I don't have a better fix for this right now so I'm fine with the committed solution, but I wanted to aknowledge that we are not strictly doing what the RFC tells (of course the RFC doesn't tell us what classes to use and what our methods have to return, so we can simply document our current approach, maybe linking to this issue). In past I analyzed the behaviour of MUA wrt spaces at the start of the "Subject" header body. Every client ignores the first space, some clients trim every space at the start of the subject. So they already do something different from what RFC says and this is why I didn't find a good solution to be strict wrt RFC and still show a similar behaviour to what most clients do.

          People

          • Assignee:
            Unassigned
            Reporter:
            Konstantin Gribov
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development