SA Bugzilla – Bug 6703
sa-learn doesn't work with kmail 2 mbox format
Last modified: 2012-08-15 04:04:48 UTC
On 3.3.2 version command sa-learn from maibox I have next on any input mailboxes: > sa-learn --spam --mbox spam-2011-10-16.mbox > Learned tokens from 0 message(s) (0 message(s) examined) After downgrade to v 3.2.5 on same files I have next output: > sa-learn --spam --mbox spam-2011-10-16.mbox > Learned tokens from 0 message(s) (3303 message(s) examined) learning from splitted messages does work correctly.
When saving messages with kmail 1 the From Line has following format which is not iaw RFC 822: From info@ende-18-06.com Fri Jun 17 16:03:07 2011 which is not iaw RFC 822 but will be parsed by sa-learn reporting: Learned tokens from 1 message(s) (1 message(s) examined With kmail 2 the format is changed to the format which is iaw RFC 822 From thomas@arend-rhb.de Tue, 15 May 2012 22:01:41 +0200 which is not parsed correctly by sa-learn. sa-learn --spam reports: Learned tokens from 0 message(s) (0 message(s) examined) Proposed to change the behavior in a way that the old malformed From lines and the new correct ones are parsed.
Created attachment 5066 [details] Message saved from kmail Version 4.8.3 Message with new From line format
Created attachment 5067 [details] Message saved from kmail Version 4.8.3 Message with new From line format iaw RFC 822
Comment on attachment 5067 [details] Message saved from kmail Version 4.8.3 Sorry, this was saved with KMail: 1.13.6
> When saving messages with kmail 1 the From Line has following format > which is not iaw RFC 822: > From info@ende-18-06.com Fri Jun 17 16:03:07 2011 > With kmail 2 the format is changed to the format which is iaw RFC 822 > From thomas@arend-rhb.de Tue, 15 May 2012 22:01:41 +0200 > which is not parsed correctly by sa-learn. sa-learn --spam reports: [...] Oh, no, not yet another incompatible mbox format!!! > Proposed to change the behavior in a way that the old malformed From lines > and the new correct ones are parsed. It is the other way around, the new one differs from everybody else. The format of the mbox file (along with its separator From_ lines) is *not* governed by RFC 822 or its successors. There is no formal standard for an mbox format, the RFC 4155 comes closest: http://tools.ietf.org/html/rfc4155 See also a Wikipedia article: http://en.wikipedia.org/wiki/Mbox RFC 4155 says: | a timestamp indicating the UTC date and time when the message | was originally received, conformant with the syntax of the | traditional UNIX 'ctime' output sans timezone (note that the | use of UTC precludes the need for a timezone indicator); This matches qmail docs: http://qmail.org/qmail-manual-html/man5/mbox.html and matches Postfix and sendmail's local delivery agent. To accommodate the new incompatible format it seems that the two instances of a regexps in ArchiveIterator.pm need to be extended, or just relaxed. Not sure if the date would still be correctly parsed. Best would be to persuade kmail folks to back off the change!
> To accommodate the new incompatible format it seems that the > two instances of a regexps in ArchiveIterator.pm need to be > extended, or just relaxed. Not sure if the date would still > be correctly parsed. > > Best would be to persuade kmail folks to back off the change! Agreeing with Mark. There is no standard mbox format. However, I can't find the ticket but I made some small tweaks recently-ish to trunk to help another mbox format issue. I wonder if trunk can parse this new format? 1 - Can you attach a file with 3 mbox format files that aren't read 2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less stringent From line test or even a way to specify a regex to use in a cf? Regards, KAM
> 1 - Can you attach a file with 3 mbox format files that aren't read Two are already attached, one old and one new. I tried it, and the trunk can't parse the new one. > 2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less > stringent From line test or even a way to specify a regex to use in a cf? I thought of that. The From line test is really quite stringent. The concern is that one could encounter an mbox file in an mboxcl or mboxcl2 format (see the Wikipedia article on mbox), where a From may not even be protected by a '>', but parsing relies on a Content-Length header field. I believe this is also being adopted by Dovecot.
(In reply to comment #8) > > 1 - Can you attach a file with 3 mbox format files that aren't read > > Two are already attached, one old and one new. > I tried it, and the trunk can't parse the new one. I meant a single mbox file with the new format with 3 emails in it. > > 2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less > > stringent From line test or even a way to specify a regex to use in a cf? > > I thought of that. The From line test is really quite stringent. Agreed. > The concern is that one could encounter an mbox file in an mboxcl or mboxcl2 > format (see the Wikipedia article on mbox), where a From may not even > be protected by a '>', but parsing relies on a Content-Length header > field. I believe this is also being adopted by Dovecot. Well at least mboxcl is a different format and we'll have to cross that bridge if it looks like it will be adopted.
Created attachment 5068 [details] Example with 3 true positive spam messages This is am example for sa-learn with 3 true positive spam mails.
Created attachment 5069 [details] Example with 3 true negative (ham) messages This is am mbox example file with three true negative (ham) messages. "sa-learn --ham --mbox" reports Learned tokens from 0 message(s) (0 message(s) examined)
So, what you think about that version 3.2.5 is working fine with same mailboxes on same perl version?
A first quick and dirty solution to work around this problem was encouraging. I converted the kmail2 mbox with the following command to the older version: sed '/^From / s#\(...\), \([0-3][0-9]\) \(...\) 2012 \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\) [+-][0-9]\{4\}#\1 \3 \2 \4 2012#' <spam-old.mbox >spam-new.mbox sa-learn reported: Learned tokens from 1489 message(s) (1733 message(s) examined) Which may be ok some messages may be learned earlier. A trailing "0" in the day - which may not comply to ctime format - does not effect the parsing of the mbox file. This work-araound will help me over the next days and weeks. But learning in kmail2 may not work or will need a modification. I understand that there are different definitions for the date time in the "From_" line. What a mess! I agree that the best would be to persuade the kmail folks to back off the change. You may refer to the following bug report: 297198 https://bugs.kde.org/show_bug.cgi?id=297198
(In reply to comment #13) > A first quick and dirty solution to work around this problem was encouraging. > > I converted the kmail2 mbox with the following command to the older version: > > sed '/^From / s#\(...\), \([0-3][0-9]\) \(...\) 2012 > \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\) [+-][0-9]\{4\}#\1 \3 \2 \4 2012#' > <spam-old.mbox >spam-new.mbox > > sa-learn reported: > Learned tokens from 1489 message(s) (1733 message(s) examined) > > Which may be ok some messages may be learned earlier. > > A trailing "0" in the day - which may not comply to ctime format - does not > effect the parsing of the mbox file. > > This work-araound will help me over the next days and weeks. But learning in > kmail2 may not work or will need a modification. > > I understand that there are different definitions for the date time in the > "From_" line. What a mess! > > I agree that the best would be to persuade the kmail folks to back off the > change. You may refer to the following bug report: 297198 > > https://bugs.kde.org/show_bug.cgi?id=297198 Very interesting idea to use sed. You should be able to pipe that directly to sa-learn as well instead of even using a temporary file. That's a good temporary fix.
>Very interesting idea to use sed. You should be able to pipe that directly to >sa-learn as well instead of even using a temporary file. That's a good >temporary fix. Thanks, I did not time zone adjustment. Would be a bit complicate for sed. I hope this will not effect the spam tokens to much.
> I agree that the best would be to persuade the kmail folks to back off the > change. You may refer to the following bug report: 297198 > https://bugs.kde.org/show_bug.cgi?id=297198 Great, thanks. Could you please add contents of Comment 6 (possibly edited if you wish) to that kde ticket, and/or a ref to this PR: http://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703
(In reply to comment #16) > > I agree that the best would be to persuade the kmail folks to back off the > > change. You may refer to the following bug report: 297198 > > https://bugs.kde.org/show_bug.cgi?id=297198 > > Great, thanks. Could you please add contents of Comment 6 > (possibly edited if you wish) to that kde ticket, > and/or a ref to this PR: > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703 Have added your comment to the KDE bug report and added a cross reference to this report.
Thanks to the hint to ArchiveIterator.pm I found two regex for checking the From_ line in Mail::SpamAssassin::ArchiveIterator and replaced them with following regex: (I am running Version 3.3.1) /From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ ) This replacement works fine for me and it is compatible with both mbox date formats. My pattern checks the date stricter than the old pattern which may be an advantage or an disadvantage. diff (orig) (new) 399c399 < last if (substr($_,0,5) eq "From " && @msg && /^From \S+ ?\S\S\S \S\S\S .\d .\d:\d\d:\d\d \d{4}/); --- > last if (substr($_,0,5) eq "From " && @msg && /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ ); 912c912,913 < /^From \S+ ?\S\S\S \S\S\S .\d .\d:\d\d:\d\d \d{4}/) { --- > /From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ ) {
Upps there was a small copy error. There must be a ^ before "From " /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ )
(In reply to comment #19) > Upps there was a small copy error. There must be a ^ before "From " > > /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d > (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d > [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ > 0-2]\d:\d\d:\d\d \d{4})/ ) This regex scares me because of the localization issue. For example, Lun for Monday in Spanish.
(In reply to comment #7) > However, I can't find the ticket but I made some small tweaks recently-ish > to trunk to help another mbox format issue. I wonder if trunk can parse > this new format? Bug 6413 and no, that was for communigate not kmail. Not relevant except it touches on the same non-standardization for mbox formats.
(In reply to comment #20) > (In reply to comment #19) > > Upps there was a small copy error. There must be a ^ before "From " > > > > /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d > > (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d > > [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ > > 0-2]\d:\d\d:\d\d \d{4})/ ) > > This regex scares me because of the localization issue. For example, Lun > for Monday in Spanish. Agreed. How about: /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ I'm assuming [:upper:] and [:lower:] will match accented characters properly. I haven't tested that assumption.
Created attachment 5070 [details] Patch to add options for defining the ArchiveIterator From regex as a Conf file option > How about: > > /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ > 0-2]\d:\d\d:\d\d \d{4})/ > > I'm assuming [:upper:] and [:lower:] will match accented characters > properly. I haven't tested that assumption. I don't know enough about foreign languages to know for sure the format is always leading caps, etc. So I went ahead and wrote the patch to move this to a configurable option. It appears to work testing with the mbox with 3 ham messages attached previously. "Learned tokens from 3 message(s) (3 message(s) examined)" Thoughts? KAM
(In reply to comment #22) > (In reply to comment #20) > > (In reply to comment #19) > > > Upps there was a small copy error. There must be a ^ before "From " > > > > > > /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d > > > (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d > > > [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ > > > 0-2]\d:\d\d:\d\d \d{4})/ ) > > > > This regex scares me because of the localization issue. For example, Lun > > for Monday in Spanish. > > Agreed. > > How about: > > /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ > 0-2]\d:\d\d:\d\d \d{4})/ > > I'm assuming [:upper:] and [:lower:] will match accented characters > properly. I haven't tested that assumption. Are you sure that we have a localization issue in the header fields? I use the German versions of Outlook Express, Thunderbird, Kmail and Evolution. The time stamps in the header are not localized.
It's getting worse with the kmail folks. The sometimes put to From_ lines in the mbox file. From thomas@arend-rhb.de Sat, 19 May 2012 00:10:44 +0200 From thomas@arend-rhb.de Sat May 19 00: 16:57 2012 [..] In teh second line you can see that they spend an extra space in case the hour would extend to more than 99 min in the future. I will report this to kmail. Have a nice Sunday!
(In reply to comment #22) > > /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ > 0-2]\d:\d\d:\d\d \d{4})/ I doin't see the need of the sequence "(?:," in the regex. For me it works with "(,"
Created attachment 5071 [details] Test mbox showing two From_ lines This is an export of four messages which show two from lines.
(In reply to comment #27) > Created attachment 5071 [details] > Test mbox showing two From_ lines > > This is an export of four messages which show two from lines. Just found that the corrupted From_ line was added when filtering through formail. Don`t ask me why. Formail works on command line as expected.
(In reply to comment #26) > (In reply to comment #22) > > > /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} > > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ > > 0-2]\d:\d\d:\d\d \d{4})/ > > I doin't see the need of the sequence "(?:," in the regex. For me it works > with "(," There is no reason to capture the results of the match for later use. (?:...) is a non-capturing match, which is slightly more efficient. To allow for variations in whitespace, perhaps: /^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/
(In reply to comment #29) > (In reply to comment #26) > > (In reply to comment #22) > > > > > /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} > > > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ > > > 0-2]\d:\d\d:\d\d \d{4})/ > > > > I doin't see the need of the sequence "(?:," in the regex. For me it works > > with "(," > > There is no reason to capture the results of the match for later use. > (?:...) is a non-capturing match, which is slightly more efficient. > > To allow for variations in whitespace, perhaps: > > /^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d > [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| > [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ Well more importantly than specific kmail 2 mbox regexes , does the patch I wrote that let's someone use /^Mickey Mouse$/ as their mbox separator regular expression work? Regards, KAM
(In reply to comment #29) > To allow for variations in whitespace, perhaps: > > /^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d > [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| > [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ I never saw more than one whitespace after the "From" but one or two white spaces after the "e-mail" address / before the weekday.
Patch committed. svn commit -m 'mbox_format_from_regex option addition bug 6703' sa-learn.raw lib/Mail/SpamAssassin/ArchiveIterator.pm lib/Mail/SpamAssassin/Conf.pm Sending lib/Mail/SpamAssassin/ArchiveIterator.pm Sending lib/Mail/SpamAssassin/Conf.pm Sending sa-learn.raw Transmitting file data ... Committed revision 1350839.
> Patch committed. This doesn't work as intended, sa-learn now rejects most messages as oversized. The problem is that mbox_format_from_regex in Conf.pm defaults to an empty string, but _set_default_message_selection_opts() tests for undef, so the regexp in $self->{opt_from_regex} remains an empty string, which according to some bizarre Perl brain damage defaults to a last regexp that matched, so the ArchiveIterator keeps finding huge messages. The fix is trivial, just removing the default => '' from the Conf.pm entry on a config option 'mbox_format_from_regex'. Also I think the /$self->{opt_from_regex}/ in sub _run_mailbox could deserve an /o flag, as I don't think its value can change once it has been set and was used for the first time. Btw, while investigating why sa-learn broke I ended up in letting ArchiveIterator and sa-learn be able to take a message size limit option. Will open a separate ticket for this. Sending ArchiveIterator.pm Sending Conf.pmT Committed revision 1354072.
Thanks, Mark. That would have driven me up a wall!
(In reply to comment #34) > Thanks, Mark. That would have driven me up a wall! Does mbox_format_from_regex break anything with mass-check? From John Hardin: When I run masschecks locally against an up-to-date repo, it is not setting the message boundary RE properly end gets scads of uninitialized variable errors trying to parse the corpus mailbox files. Last I looked, I added some warn() output and it was setting the default RE properly but then appeared to be resetting it later somewhere.
mass_checks child process: mass_checks::generate_queue -> ArchiveIterator::_scan_targets -> ArchiveIterator::_set_default_message_selection_opts (sets opt_from_regex for a child process) mass_checks parent process: waits for child; mass_checks::run_post_scan -> run_through_messages --> $iter->_run_message (runs with an undefined opt_from_regex)
(In reply to comment #36) > mass_checks child process: > mass_checks::generate_queue > -> ArchiveIterator::_scan_targets > -> ArchiveIterator::_set_default_message_selection_opts > (sets opt_from_regex for a child process) > > mass_checks parent process: > waits for child; > mass_checks::run_post_scan > -> run_through_messages > --> $iter->_run_message (runs with an undefined opt_from_regex) This is also happening to Marc Andre Selig. I think we need to revert the patches related to this, see if mbox numbers improve and then figure out a masscheck safe way to implement the fixes. That's commits 1350839 & 1354072 however fundamentally, I thought I set a sane default for the option and this is confusing me a lot.
(In reply to comment #37) > (In reply to comment #36) > > mass_checks child process: > > mass_checks::generate_queue > > -> ArchiveIterator::_scan_targets > > -> ArchiveIterator::_set_default_message_selection_opts > > (sets opt_from_regex for a child process) > > > > mass_checks parent process: > > waits for child; > > mass_checks::run_post_scan > > -> run_through_messages > > --> $iter->_run_message (runs with an undefined opt_from_regex) > > > This is also happening to Marc Andre Selig. > > I think we need to revert the patches related to this, see if mbox numbers > improve and then figure out a masscheck safe way to implement the fixes. > > That's commits 1350839 & 1354072 however fundamentally, I thought I set a > sane default for the option and this is confusing me a lot. The issues with masscheck and archiveiterator are related and at the same time unrelated. IMO, we are improperly using the ArchiteIterator class without calling _set_default_message_selection_opts which is where I set the sanity check for the opt_from_regex. I think this should be called from new when the class is constructed rather than rely on someone using the API to call it. However, bowing to existing structure, I have added a call to _set_default_message_selection_opts in mass-check.sh. I've checked and this change appears to be the fix. Index: masses/mass-check =================================================================== --- masses/mass-check (revision 1373195) +++ masses/mass-check (working copy) @@ -419,6 +419,9 @@ $iter->set_functions(\&wanted, \&result); } +#AFTER SETTING ALL OPTS, RUN _set_default_message_selection_opts() TO GET SANE DEFAULTS +$iter->_set_default_message_selection_opts(); + my $messages; # normal mode as well as a server do scan mode and get a temp file svn commit -m 'change to mass-check to get ArchiteIterator sane defaults like the from separator' Sending lib/Mail/SpamAssassin/ArchiveIterator.pm Sending masses/mass-check Transmitting file data .. Committed revision 1373202.