Bug 6703 - sa-learn doesn't work with kmail 2 mbox format
sa-learn doesn't work with kmail 2 mbox format
Status: RESOLVED FIXED
Product: Spamassassin
Classification: Unclassified
Component: Learner
3.3.2
PC Linux
: P2 normal
: 3.4.0
Assigned To: SpamAssassin Developer Mailing List
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2011-11-16 17:30 UTC by Dmitry Lyanguzov
Modified: 2012-08-15 04:04 UTC (History)
6 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Message saved from kmail Version 4.8.3 application/mbox None Thomas Arend [NoCLA]
Message saved from kmail Version 4.8.3 application/mbox None Thomas Arend [NoCLA]
Example with 3 true positive spam messages application/mbox None Thomas Arend [NoCLA]
Example with 3 true negative (ham) messages application/mbox None Thomas Arend [NoCLA]
Patch to add options for defining the ArchiveIterator From regex as a Conf file option patch None Kevin A. McGrail [HasCLA]
Test mbox showing two From_ lines application/mbox None Thomas Arend [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Lyanguzov 2011-11-16 17:30:00 UTC
On 3.3.2 version command sa-learn from maibox I have next on any input mailboxes:
> sa-learn --spam --mbox spam-2011-10-16.mbox
> Learned tokens from 0 message(s) (0 message(s) examined)

After downgrade to v 3.2.5 on same files I have next output:

> sa-learn --spam --mbox spam-2011-10-16.mbox
> Learned tokens from 0 message(s) (3303 message(s) examined)

learning from splitted messages does work correctly.
Comment 2 Thomas Arend 2012-05-15 21:23:18 UTC
When saving messages with kmail 1 the From Line has following format which is not iaw RFC 822:

  From info@ende-18-06.com Fri Jun 17 16:03:07 2011

which is not iaw RFC 822 but will be parsed by sa-learn reporting:
  
  Learned tokens from 1 message(s) (1 message(s) examined

With kmail 2 the format is changed to the format which is iaw RFC 822

  From thomas@arend-rhb.de Tue, 15 May 2012 22:01:41 +0200

which is not parsed correctly by sa-learn. sa-learn --spam reports: 

  Learned tokens from 0 message(s) (0 message(s) examined)


Proposed to change the behavior in a way that the old malformed From lines and the new correct ones are parsed.
Comment 3 Thomas Arend 2012-05-15 21:24:23 UTC
Created attachment 5066 [details]
Message saved from kmail Version 4.8.3

Message with new From line format
Comment 4 Thomas Arend 2012-05-15 21:25:30 UTC
Created attachment 5067 [details]
Message saved from kmail Version 4.8.3

Message with new From line format iaw RFC 822
Comment 5 Thomas Arend 2012-05-15 21:27:26 UTC
Comment on attachment 5067 [details]
Message saved from kmail Version 4.8.3

Sorry, this was saved with KMail: 1.13.6
Comment 6 Mark Martinec 2012-05-16 13:42:01 UTC
> When saving messages with kmail 1 the From Line has following format
> which is not iaw RFC 822:
>   From info@ende-18-06.com Fri Jun 17 16:03:07 2011
> With kmail 2 the format is changed to the format which is iaw RFC 822
>   From thomas@arend-rhb.de Tue, 15 May 2012 22:01:41 +0200
> which is not parsed correctly by sa-learn. sa-learn --spam reports: [...]

Oh, no, not yet another incompatible mbox format!!!

> Proposed to change the behavior in a way that the old malformed From lines
> and the new correct ones are parsed.

It is the other way around, the new one differs from everybody else.

The format of the mbox file (along with its separator From_ lines)
is *not* governed by RFC 822 or its successors. There is no formal
standard for an mbox format, the RFC 4155 comes closest:
  http://tools.ietf.org/html/rfc4155
See also a Wikipedia article:
  http://en.wikipedia.org/wiki/Mbox


RFC 4155 says:

| a timestamp indicating the UTC date and time when the message
| was originally received, conformant with the syntax of the
| traditional UNIX 'ctime' output sans timezone (note that the
| use of UTC precludes the need for a timezone indicator);

This matches qmail docs:
  http://qmail.org/qmail-manual-html/man5/mbox.html
and matches Postfix and sendmail's local delivery agent.


To accommodate the new incompatible format it seems that the
two instances of a regexps in ArchiveIterator.pm need to be
extended, or just relaxed. Not sure if the date would still
be correctly parsed.

Best would be to persuade kmail folks to back off the change!
Comment 7 Kevin A. McGrail 2012-05-16 14:02:52 UTC
> To accommodate the new incompatible format it seems that the
> two instances of a regexps in ArchiveIterator.pm need to be
> extended, or just relaxed. Not sure if the date would still
> be correctly parsed.
> 
> Best would be to persuade kmail folks to back off the change!

Agreeing with Mark.  There is no standard mbox format.  

However, I can't find the ticket but I made some small tweaks recently-ish to trunk to help another mbox format issue.  I wonder if trunk can parse this new format?

1 - Can you attach a file with 3 mbox format files that aren't read

2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less stringent From line test or even a way to specify a regex to use in a cf?  

Regards,
KAM
Comment 8 Mark Martinec 2012-05-16 14:10:08 UTC
> 1 - Can you attach a file with 3 mbox format files that aren't read

Two are already attached, one old and one new.
I tried it, and the trunk can't parse the new one.
 
> 2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less
> stringent From line test or even a way to specify a regex to use in a cf?  

I thought of that. The From line test is really quite stringent.

The concern is that one could encounter an mbox file in an mboxcl or mboxcl2
format (see the Wikipedia article on mbox), where a From may not even
be protected by a '>', but parsing relies on a Content-Length header
field. I believe this is also being adopted by Dovecot.
Comment 9 Kevin A. McGrail 2012-05-16 14:14:02 UTC
(In reply to comment #8)
> > 1 - Can you attach a file with 3 mbox format files that aren't read
> 
> Two are already attached, one old and one new.
> I tried it, and the trunk can't parse the new one.

I meant a single mbox file with the new format with 3 emails in it.

> > 2 - Perhaps we need a relaxed option in ArchiveIterator that uses a less
> > stringent From line test or even a way to specify a regex to use in a cf?  
> 
> I thought of that. The From line test is really quite stringent.

Agreed.

> The concern is that one could encounter an mbox file in an mboxcl or mboxcl2
> format (see the Wikipedia article on mbox), where a From may not even
> be protected by a '>', but parsing relies on a Content-Length header
> field. I believe this is also being adopted by Dovecot.

Well at least mboxcl is a different format and we'll have to cross that bridge if it looks like it will be adopted.
Comment 10 Thomas Arend 2012-05-16 18:08:15 UTC
Created attachment 5068 [details]
Example with 3 true positive spam messages

This is am example for sa-learn with 3 true positive spam mails.
Comment 11 Thomas Arend 2012-05-16 18:10:12 UTC
Created attachment 5069 [details]
Example with 3 true negative (ham) messages

This is am mbox example file with three true negative (ham) messages.

"sa-learn --ham --mbox" reports
Learned tokens from 0 message(s) (0 message(s) examined)
Comment 12 Dmitry Lyanguzov 2012-05-16 18:46:48 UTC
So, what you think about that version 3.2.5 is working fine with same mailboxes on same perl version?
Comment 13 Thomas Arend 2012-05-16 19:10:31 UTC
A first quick and dirty solution to work around this problem was encouraging.

I converted the kmail2 mbox with the following command to the older version:

sed '/^From / s#\(...\), \([0-3][0-9]\) \(...\) 2012 \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\) [+-][0-9]\{4\}#\1 \3 \2 \4 2012#' <spam-old.mbox >spam-new.mbox

sa-learn reported:
Learned tokens from 1489 message(s) (1733 message(s) examined)

Which may be ok some messages may be learned earlier.

A trailing "0" in the day - which may not comply to ctime format - does not effect the parsing of the mbox file.

This work-araound will help me over the next days and weeks. But learning in kmail2 may not work or will need a modification.

I understand that there are different definitions for the date time in the "From_" line. What a mess!

I agree that the best would be to persuade the kmail folks to back off the change. You may refer to the following bug report: 297198 

https://bugs.kde.org/show_bug.cgi?id=297198
Comment 14 Kevin A. McGrail 2012-05-16 19:22:26 UTC
(In reply to comment #13)
> A first quick and dirty solution to work around this problem was encouraging.
> 
> I converted the kmail2 mbox with the following command to the older version:
> 
> sed '/^From / s#\(...\), \([0-3][0-9]\) \(...\) 2012
> \([0-2][0-9]:[0-5][0-9]:[0-5][0-9]\) [+-][0-9]\{4\}#\1 \3 \2 \4 2012#'
> <spam-old.mbox >spam-new.mbox
> 
> sa-learn reported:
> Learned tokens from 1489 message(s) (1733 message(s) examined)
> 
> Which may be ok some messages may be learned earlier.
> 
> A trailing "0" in the day - which may not comply to ctime format - does not
> effect the parsing of the mbox file.
> 
> This work-araound will help me over the next days and weeks. But learning in
> kmail2 may not work or will need a modification.
> 
> I understand that there are different definitions for the date time in the
> "From_" line. What a mess!
> 
> I agree that the best would be to persuade the kmail folks to back off the
> change. You may refer to the following bug report: 297198 
> 
> https://bugs.kde.org/show_bug.cgi?id=297198

Very interesting idea to use sed.  You should be able to pipe that directly to sa-learn as well instead of even using a temporary file.  That's a good temporary fix.
Comment 15 Thomas Arend 2012-05-16 19:43:49 UTC
>Very interesting idea to use sed.  You should be able to pipe that directly to >sa-learn as well instead of even using a temporary file.  That's a good >temporary fix.

Thanks,

I did not time zone adjustment. Would be a bit complicate for sed. I hope this will not effect the spam tokens to much.
Comment 16 Mark Martinec 2012-05-16 20:31:44 UTC
> I agree that the best would be to persuade the kmail folks to back off the
> change. You may refer to the following bug report: 297198 
> https://bugs.kde.org/show_bug.cgi?id=297198

Great, thanks. Could you please add contents of Comment 6
(possibly edited if you wish) to that kde ticket,
and/or a ref to this PR:
  http://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703
Comment 17 Thomas Arend 2012-05-16 21:13:41 UTC
(In reply to comment #16)
> > I agree that the best would be to persuade the kmail folks to back off the
> > change. You may refer to the following bug report: 297198 
> > https://bugs.kde.org/show_bug.cgi?id=297198
> 
> Great, thanks. Could you please add contents of Comment 6
> (possibly edited if you wish) to that kde ticket,
> and/or a ref to this PR:
>   http://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

Have added your comment to the KDE bug report and added a cross reference to this report.
Comment 18 Thomas Arend 2012-05-17 19:39:53 UTC
Thanks to the hint to ArchiveIterator.pm I found two regex for checking the From_ line in Mail::SpamAssassin::ArchiveIterator and replaced them with following regex: (I am running Version 3.3.1)
 
/From \S+  ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ )

This replacement works fine for me and it is compatible with both mbox date formats. My pattern checks the date stricter than the old pattern which may be an advantage or an disadvantage.

diff (orig) (new)
399c399
<     last if (substr($_,0,5) eq "From " && @msg && /^From \S+  ?\S\S\S \S\S\S .\d .\d:\d\d:\d\d \d{4}/);
---
>     last if (substr($_,0,5) eq "From " && @msg && /^From \S+  ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ );
912c912,913
<             /^From \S+  ?\S\S\S \S\S\S .\d .\d:\d\d:\d\d \d{4}/) {
---
>             /From \S+  ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ ) {
Comment 19 Thomas Arend 2012-05-17 19:50:55 UTC
Upps there was a small copy error. There must be a ^ before "From "

/^From \S+  ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ )
Comment 20 Kevin A. McGrail 2012-05-17 20:24:10 UTC
(In reply to comment #19)
> Upps there was a small copy error. There must be a ^ before "From "
> 
> /^From \S+  ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d
> (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d
> [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [
> 0-2]\d:\d\d:\d\d \d{4})/ )

This regex scares me because of the localization issue.  For example, Lun for Monday in Spanish.
Comment 21 Kevin A. McGrail 2012-05-17 20:26:39 UTC
(In reply to comment #7)
> However, I can't find the ticket but I made some small tweaks recently-ish
> to trunk to help another mbox format issue.  I wonder if trunk can parse
> this new format?

Bug 6413 and no, that was for communigate not kmail.  Not relevant except it touches on the same non-standardization for mbox formats.
Comment 22 John Hardin 2012-05-17 21:38:04 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > Upps there was a small copy error. There must be a ^ before "From "
> > 
> > /^From \S+  ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d
> > (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d
> > [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [
> > 0-2]\d:\d\d:\d\d \d{4})/ )
> 
> This regex scares me because of the localization issue.  For example, Lun
> for Monday in Spanish.

Agreed.

How about:

/^From \S+  ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/

I'm assuming [:upper:] and [:lower:] will match accented characters properly. I haven't tested that assumption.
Comment 23 Kevin A. McGrail 2012-05-20 02:41:08 UTC
Created attachment 5070 [details]
Patch to add options for defining the ArchiveIterator From regex as a Conf file option

> How about:
> 
> /^From \S+  ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> 0-2]\d:\d\d:\d\d \d{4})/
> 
> I'm assuming [:upper:] and [:lower:] will match accented characters
> properly. I haven't tested that assumption.

I don't know enough about foreign languages to know for sure the format is always leading caps, etc. So I went ahead and wrote the patch to move this to a configurable option.

It appears to work testing with the mbox with 3 ham messages attached previously.  

"Learned tokens from 3 message(s) (3 message(s) examined)"

Thoughts?

KAM
Comment 24 Thomas Arend 2012-05-20 11:54:45 UTC

 (In reply to comment #22)
> (In reply to comment #20)
> > (In reply to comment #19)
> > > Upps there was a small copy error. There must be a ^ before "From "
> > > 
> > > /^From \S+  ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d
> > > (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d
> > > [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [
> > > 0-2]\d:\d\d:\d\d \d{4})/ )
> > 
> > This regex scares me because of the localization issue.  For example, Lun
> > for Monday in Spanish.
> 
> Agreed.
> 
> How about:
> 
> /^From \S+  ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> 0-2]\d:\d\d:\d\d \d{4})/
> 
> I'm assuming [:upper:] and [:lower:] will match accented characters
> properly. I haven't tested that assumption.

Are you sure that we have a localization issue in the header fields? I use the German versions of Outlook Express, Thunderbird, Kmail and Evolution. The time stamps in the header are not localized.
Comment 25 Thomas Arend 2012-05-20 12:09:29 UTC
It's getting worse with the kmail folks. The sometimes put to From_ lines in the mbox file.

From thomas@arend-rhb.de Sat, 19 May 2012 00:10:44 +0200
From thomas@arend-rhb.de  Sat May 19 00: 16:57 2012
[..]

In teh second line you can see that they spend an extra space in case the hour would extend to more than 99 min in the future.

I will report this to kmail.

Have a nice Sunday!
Comment 26 Thomas Arend 2012-05-20 12:11:22 UTC
(In reply to comment #22)

> 
> /^From \S+  ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> 0-2]\d:\d\d:\d\d \d{4})/

I doin't see the need of the sequence "(?:," in the regex. For me it works with "(,"
Comment 27 Thomas Arend 2012-05-20 12:16:26 UTC
Created attachment 5071 [details]
Test mbox showing two From_ lines

This is an export of four messages which show two from lines.
Comment 28 Thomas Arend 2012-05-20 13:41:23 UTC
(In reply to comment #27)
> Created attachment 5071 [details]
> Test mbox showing two From_ lines
> 
> This is an export of four messages which show two from lines.

Just found that the corrupted From_ line was added when filtering through formail. Don`t ask me why. Formail works on command line as expected.
Comment 29 John Hardin 2012-05-20 16:54:40 UTC
(In reply to comment #26)
> (In reply to comment #22)
> 
> > /^From \S+  ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> > 0-2]\d:\d\d:\d\d \d{4})/
> 
> I doin't see the need of the sequence "(?:," in the regex. For me it works
> with "(,"

There is no reason to capture the results of the match for later use. (?:...) is a non-capturing match, which is slightly more efficient.

To allow for variations in whitespace, perhaps:

/^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/
Comment 30 Kevin A. McGrail 2012-05-20 17:09:42 UTC
(In reply to comment #29)
> (In reply to comment #26)
> > (In reply to comment #22)
> > 
> > > /^From \S+  ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> > > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> > > 0-2]\d:\d\d:\d\d \d{4})/
> > 
> > I doin't see the need of the sequence "(?:," in the regex. For me it works
> > with "(,"
> 
> There is no reason to capture the results of the match for later use.
> (?:...) is a non-capturing match, which is slightly more efficient.
> 
> To allow for variations in whitespace, perhaps:
> 
> /^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d
> [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}|
> [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/

Well more importantly than specific kmail 2 mbox regexes , does the patch I wrote that let's someone use /^Mickey Mouse$/ as their mbox separator regular expression work?  

Regards,
KAM
Comment 31 Thomas Arend 2012-05-20 19:33:35 UTC
(In reply to comment #29)
> To allow for variations in whitespace, perhaps:
> 
> /^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d
> [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}|
> [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/

I never saw more than one whitespace after the "From" but one or two white spaces after the "e-mail" address / before the weekday.
Comment 32 Kevin A. McGrail 2012-06-16 02:29:45 UTC
Patch committed.

svn commit -m 'mbox_format_from_regex option addition bug 6703' sa-learn.raw lib/Mail/SpamAssassin/ArchiveIterator.pm lib/Mail/SpamAssassin/Conf.pm
Sending        lib/Mail/SpamAssassin/ArchiveIterator.pm
Sending        lib/Mail/SpamAssassin/Conf.pm
Sending        sa-learn.raw
Transmitting file data ...
Committed revision 1350839.
Comment 33 Mark Martinec 2012-06-26 15:51:33 UTC
> Patch committed.

This doesn't work as intended, sa-learn now rejects most messages
as oversized.

The problem is that mbox_format_from_regex in Conf.pm defaults to
an empty string, but _set_default_message_selection_opts() tests
for undef, so the regexp in $self->{opt_from_regex} remains an
empty string, which according to some bizarre Perl brain damage
defaults to a last regexp that matched, so the ArchiveIterator
keeps finding huge messages.

The fix is trivial, just removing the  default => ''
from the Conf.pm entry on a config option 'mbox_format_from_regex'.

Also I think the /$self->{opt_from_regex}/ in sub _run_mailbox
could deserve an /o flag, as I don't think its value can change
once it has been set and was used for the first time.

Btw, while investigating why sa-learn broke I ended up in
letting ArchiveIterator and sa-learn be able to take a message
size limit option. Will open a separate ticket for this.


  Sending ArchiveIterator.pm
  Sending Conf.pmT
Committed revision 1354072.
Comment 34 Kevin A. McGrail 2012-06-26 20:34:05 UTC
Thanks, Mark.  That would have driven me up a wall!
Comment 35 Kevin A. McGrail 2012-08-08 15:22:09 UTC
(In reply to comment #34)
> Thanks, Mark.  That would have driven me up a wall!

Does mbox_format_from_regex break anything with mass-check?

From John Hardin:

When I run masschecks locally against an up-to-date repo, it is not setting the message boundary RE properly end gets scads of uninitialized variable errors trying to parse the corpus mailbox files. Last I looked, I added some warn() output and it was setting the default RE properly but then appeared to be resetting it later somewhere.
Comment 36 Mark Martinec 2012-08-09 15:11:45 UTC
mass_checks child process:
  mass_checks::generate_queue
    -> ArchiveIterator::_scan_targets
      -> ArchiveIterator::_set_default_message_selection_opts
              (sets opt_from_regex for a child process)

mass_checks parent process:
  waits for child;
  mass_checks::run_post_scan
  -> run_through_messages
    --> $iter->_run_message  (runs with an undefined opt_from_regex)
Comment 37 Kevin A. McGrail 2012-08-13 17:13:50 UTC
(In reply to comment #36)
> mass_checks child process:
>   mass_checks::generate_queue
>     -> ArchiveIterator::_scan_targets
>       -> ArchiveIterator::_set_default_message_selection_opts
>               (sets opt_from_regex for a child process)
> 
> mass_checks parent process:
>   waits for child;
>   mass_checks::run_post_scan
>   -> run_through_messages
>     --> $iter->_run_message  (runs with an undefined opt_from_regex)


This is also happening to Marc Andre Selig.

I think we need to revert the patches related to this, see if mbox numbers improve and then figure out a masscheck safe way to implement the fixes.

That's commits 1350839 & 1354072 however fundamentally, I thought I set a sane default for the option and this is confusing me a lot.
Comment 38 Kevin A. McGrail 2012-08-15 04:04:48 UTC
(In reply to comment #37)
> (In reply to comment #36)
> > mass_checks child process:
> >   mass_checks::generate_queue
> >     -> ArchiveIterator::_scan_targets
> >       -> ArchiveIterator::_set_default_message_selection_opts
> >               (sets opt_from_regex for a child process)
> > 
> > mass_checks parent process:
> >   waits for child;
> >   mass_checks::run_post_scan
> >   -> run_through_messages
> >     --> $iter->_run_message  (runs with an undefined opt_from_regex)
> 
> 
> This is also happening to Marc Andre Selig.
> 
> I think we need to revert the patches related to this, see if mbox numbers
> improve and then figure out a masscheck safe way to implement the fixes.
> 
> That's commits 1350839 & 1354072 however fundamentally, I thought I set a
> sane default for the option and this is confusing me a lot.

The issues with masscheck and archiveiterator are related and at the same time unrelated.

IMO, we are improperly using the ArchiteIterator class without calling _set_default_message_selection_opts which is where I set the sanity check for the opt_from_regex.

I think this should be called from new when the class is constructed rather than rely on someone using the API to call it.  However, bowing to existing structure, I have added a call to _set_default_message_selection_opts in mass-check.sh.

I've checked and this change appears to be the fix.

Index: masses/mass-check
===================================================================
--- masses/mass-check   (revision 1373195)
+++ masses/mass-check   (working copy)
@@ -419,6 +419,9 @@
   $iter->set_functions(\&wanted, \&result);
 }
 
+#AFTER SETTING ALL OPTS, RUN _set_default_message_selection_opts() TO GET SANE DEFAULTS
+$iter->_set_default_message_selection_opts();
+
 my $messages;
 
 # normal mode as well as a server do scan mode and get a temp file


svn commit -m 'change to mass-check to get ArchiteIterator sane defaults like the from separator'
Sending        lib/Mail/SpamAssassin/ArchiveIterator.pm
Sending        masses/mass-check
Transmitting file data ..
Committed revision 1373202.