Bug 6127 - exited on signal 11
Summary: exited on signal 11
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: 3.2.5
Hardware: PC FreeBSD
: P5 critical
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-08 17:21 UTC by Joel
Modified: 2013-01-20 17:24 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
spamd log application/x-zip-compressed None Joel [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Joel 2009-06-08 17:21:14 UTC
I reviewed what I could find on this problem and it seems similar to a bug reported fixed.

I am getting log entries similar to:

/var/log/messages:

Jun  8 17:11:35 tahoestores kernel: pid 1214 (perl5.8.9), uid 58: exited on signal 11


/var/log/maillog:

Jun  8 17:11:35 tahoestores spamd[971]: spamd: handled cleanup of child pid 1214 due to SIGCHLD
Jun  8 17:11:35 tahoestores spamd[971]: prefork: child states: IB
Jun  8 17:11:35 tahoestores qmail-scanner-queue.pl: qmail-scanner[21168]: Clear:RC:0(85.216.253.118):SA:0(?/?): 7.370024 2234 Ari-sussie@barkershoe.co.uk fiberglassb53@atahoerose.com For:_fiberglassb53@atahoerose.com <124450628776421168@tahoestores.org> 1244506288.21176-0.tahoestores.org:1824 orig-tahoestores.org124450628754021168:2234

The returned status is (?/?) and thus the email gets delivered.

I've tried rebuilding everything I can think off, gone to --round-robin, and tried setting the -D flag.

This is a FreeBSD 6.2 system running perl5.8.9. There are no indications of hardware or memory issues.

Suggestions?

Thanks
Comment 1 Mark Martinec 2009-06-08 17:58:40 UTC
That sounds similar to Bug 6060, although in my case the crash was a SIGBUS
and not a SIGSEGV. Try first to obtain and attach a full debugging output,
to be able to see at which point it crashes. Also, the 3.1.8 rather old, it
would make sense to first try with the version currently in FreeBSD ports.
Comment 2 Joel 2009-06-08 21:31:50 UTC
(In reply to comment #1)
> That sounds similar to Bug 6060, although in my case the crash was a SIGBUS
> and not a SIGSEGV. Try first to obtain and attach a full debugging output,
> to be able to see at which point it crashes. Also, the 3.1.8 rather old, it
> would make sense to first try with the version currently in FreeBSD ports.

In that the problem became serious after I upgraded to perl5.8.9, yes it does sound similar. I am running ports and p5-Mail-SpamAssassin-3.2.5_4. Everything is up to date. 

I am not sure why spamd reports reports 3.1.8.

What is the best way to obtain a full debugging output? Debugging is set to true in qmail-scanner-queue.pl. The log qmail-queue.log does not contain any additional information. Is there some docs on this topic?

Thanks
Joel
Comment 3 Karsten Bräckelmann 2009-06-09 06:11:19 UTC
> I am not sure why spamd reports reports 3.1.8.

We don't know if it does, you tell us. :)

Unless my under-caffeinated brain is playing tricks on my, there is no version in your original comment 0 anywhere. However, you set the version field to 3.1.8 yourself. If that's not the case, please correct it.

$ spamassassin --version
Comment 4 Mark Martinec 2009-06-09 11:25:45 UTC
> What is the best way to obtain a full debugging output? Debugging is set to
> true in qmail-scanner-queue.pl. The log qmail-queue.log does not contain any
> additional information. Is there some docs on this topic?

This seems to be a question specific to qmail-scanner, you may get a
better answer about logging there. As far as SpamAssassin is concerned,
its logging is controlled by an option 'debug' as passed in a hash of
options during a call to Mail::SpamAssassin::new, such as:
  my($sa_obj) = Mail::SpamAssassin->new({debug=>'all', ...});
See man page: 'man Mail::SpamAssassin'.

Debugging goes normally to stderr, but can be captured by a caller
by supplying a logging module name in a call to:
  Mail::SpamAssassin::Logger::add(method => ..., ))
I'm sorry but I have no idea how qmail-scanner controls the logging.

Btw, since you are on FreeBSD where there is no major drama to manually
install perl modules, and later revert back to re-installing them from
ports if necessary, you may want to try it with SpamAssassin 3.3 from
CVS or a tarball. It should still be upwards and downwards compatible
with 3.2.5 in practically all aspects. To make it compatible with
FreeBSD ports version, two soft links should be added to make life
more comfortable:
  ln -s /usr/local/etc/mail/spamassassin /etc/mail/spamassassin
  ln -s /var/db/spamassassin /var/lib/spamassassin
The installation is as usual:
  perl Makefile.PL; make; make install
and then not to forget to run sa-update.

To revert back to SpamAssassin 3.2.5 from ports:
  pkg_delete -f bsdpan-Mail-SpamAssassin-3.003000
  cd /usr/ports/mail/p5-Mail-SpamAssassin/
  make; make deinstall; make reinstall; make clean
Comment 5 Joel 2009-06-09 11:35:38 UTC
(In reply to comment #3)
> > I am not sure why spamd reports reports 3.1.8.
> We don't know if it does, you tell us. :)
> Unless my under-caffeinated brain is playing tricks on my, there is no version
> in your original comment 0 anywhere. However, you set the version field to
> 3.1.8 yourself. If that's not the case, please correct it.
> $ spamassassin --version

In my mail header:

Received: from 140.211.11.140 by tahoestores.org (envelope-from <bugzilla-daemon@bugzilla.spamassassin.org>, uid 1002) with qmail-scanner-2.01 
 (clamdscan: 0.90.3/3598. spamassassin: 3.1.8.  <<<<-----------------------????
 Clear:RC:0(140.211.11.140):SA:0(-3.8/2.5):. 
 Processed in 8.132351 secs); 09 Jun 2009 01:16:59 -0000
X-Spam-Status: No, score=-3.8 required=2.5
X-Spam-Level: 

From the command line:

tahoestores#  spamassassin --version
SpamAssassin version 3.2.5
  running on Perl version 5.8.9

A qmail-scanner issue?

I have updated the version for the bug report.
Comment 6 Mark Martinec 2009-06-09 11:47:19 UTC
> In my mail header:
> 
> Received: from 140.211.11.140 by tahoestores.org (envelope-from
> <bugzilla-daemon@bugzilla.spamassassin.org>, uid 1002) with qmail-scanner-2.01 
>  (clamdscan: 0.90.3/3598. spamassassin: 3.1.8.  <<<<-----------------------????
>  Clear:RC:0(140.211.11.140):SA:0(-3.8/2.5):. 
>  Processed in 8.132351 secs); 09 Jun 2009 01:16:59 -0000
> X-Spam-Status: No, score=-3.8 required=2.5
> X-Spam-Level: 
> 
> From the command line:
> 
> tahoestores#  spamassassin --version
> SpamAssassin version 3.2.5
>   running on Perl version 5.8.9
> 
> A qmail-scanner issue?

No idea. You may have two versions of SpamAssassin installed
and qmail-scanner picking up the old one. Just guessing.
Comment 7 Karsten Bräckelmann 2009-06-09 15:37:45 UTC
Maybe a qmail scanner issue, but definitely an issue with your SA installations. Setting Version back to 3.1.8, according to comment 5 the code in question obviously is not 3.2.

Regarding comment 0, was that pid 1214 actually a spamd child process? Just making sure, cause you didn't paste anything logged from the child.

Anyway, that's a segmentation fault. I believe that would be a Perl issue, wouldn't it?
Comment 8 Joel 2009-06-09 19:02:35 UTC
(In reply to comment #7)
> Maybe a qmail scanner issue, but definitely an issue with your SA
> installations. Setting Version back to 3.1.8, according to comment 5 the code
> in question obviously is not 3.2.

Yes, after some poking around this is a qmail-scanner issue. The script attempts to record the version numbers in a file but this file never gets updated and hasn't been updated since the original installation nearly two years ago. I've been modifying the permissions to no avail.

->
      #X-Spam-Checker-Version: SpamAssassin 2.01
      open(SPAS,"$spamassassin_binary  -V |")||die "failed to call $spamassassin_binary -V - $\
!";
      $spamassassin_eng="2.x";
      while (<SPAS>) {
        chomp;
        if (/^SpamAssassin version (.*)$/i) {
          $spamassassin_eng=$1;
        }
      }
      close(SPAS);
      $SCANINFO .= "spamassassin: $spamassassin_eng. ";
    } else {
      #Catch-all for other ones
      $SCANINFO .= "$scanner: ???. ";
    }
  }
  $SCANINFO =~ s/ \. / /g;
  open(VER,">$versionfile.tmp")||die "cannot write to $versionfile.tmp - $!";
  print VER $SCANINFO;
  close(VER);
  rename("$versionfile.tmp","$versionfile");

-->

I verified that there is but one spamassassin executable on the system and that it is version 3.2.5. In fact, there is only one copy of every process involved.

> Regarding comment 0, was that pid 1214 actually a spamd child process? Just
> making sure, cause you didn't paste anything logged from the child.

The pid 1214 belongs to spamd as revealed by it's uid which is 58 on my system.

> Anyway, that's a segmentation fault. I believe that would be a Perl issue,
> wouldn't it?

Yes, it is obviously a segmentation fault that occurs in Perl. But your whole gig is written in perl and depends intimately on perl. So I believe it would be hard to wash your hands of the issue. Bug 6060 seems to make this clear.

Finding the errant piece of code is my problem.
Comment 9 Karsten Bräckelmann 2009-06-09 19:41:12 UTC
Can you reproduce the crash? Does it always happen with that mail? If so, please strip down the sample, masking or removing any confidential data and attach the sample to this bug report. Also, please change the Component to Security.

> Yes, it is obviously a segmentation fault that occurs in Perl. But your whole
> gig is written in perl and depends intimately on perl. So I believe it would be
> hard to wash your hands of the issue. Bug 6060 seems to make this clear.

I'm not saying we shouldn't work around a Perl issue, if we can. All I'm saying is, that ultimately a segmentation fault *is* a bug in Perl and needs to be fixed in Perl properly. Accessing memory at will isn't exactly something you can do in Perl...
Comment 10 Joel 2009-06-09 20:01:48 UTC
(In reply to comment #9)
> Can you reproduce the crash? Does it always happen with that mail? If so,
> please strip down the sample, masking or removing any confidential data and
> attach the sample to this bug report. Also, please change the Component to
> Security.

Thanks. The problem is occuring on a production system that processes several thousand emails per day. I hadn't expected to be spending several days looking into this problem, nor had I envisioned setting up a test platform. I am getting several hundred failures per day. 

> > Yes, it is obviously a segmentation fault that occurs in Perl. But your whole
> > gig is written in perl and depends intimately on perl. So I believe it would be
> > hard to wash your hands of the issue. Bug 6060 seems to make this clear.
> I'm not saying we shouldn't work around a Perl issue, if we can. All I'm saying
> is, that ultimately a segmentation fault *is* a bug in Perl and needs to be
> fixed in Perl properly. Accessing memory at will isn't exactly something you
> can do in Perl...

I agree, but I am also sure that I can write a piece of perl that causes a segmentation fault and that would not neccessary indicate a perl bug.

What I would like to do is continue on with comment #4 so that we can identify where the crash is occuring.
Comment 11 Joel 2009-06-09 20:05:04 UTC
(In reply to comment #4)
> > What is the best way to obtain a full debugging output? Debugging is set to
> > true in qmail-scanner-queue.pl. The log qmail-queue.log does not contain any
> > additional information. Is there some docs on this topic?
> This seems to be a question specific to qmail-scanner, you may get a
> better answer about logging there. As far as SpamAssassin is concerned,
> its logging is controlled by an option 'debug' as passed in a hash of
> options during a call to Mail::SpamAssassin::new, such as:
>   my($sa_obj) = Mail::SpamAssassin->new({debug=>'all', ...});
> See man page: 'man Mail::SpamAssassin'.
> Debugging goes normally to stderr, but can be captured by a caller
> by supplying a logging module name in a call to:
>   Mail::SpamAssassin::Logger::add(method => ..., ))

Thanks. Do you have a sample logging module? I am sure I can come up with one but I am running a production system and it would be helpful to have a sample from the experts to minimize problems.
Comment 12 Mark Martinec 2009-06-10 02:38:52 UTC
> Thanks. Do you have a sample logging module? I am sure I can come up with one
> but I am running a production system and it would be helpful to have a sample
> from the experts to minimize problems.

Are we talking about sub-spamassassin.pl in qmail-scanner?
It is not calling SpamAssassin directly, but just forking
a spamc and feeding mail to it through a pipe. So as far as
SpamAssassin is concerned, this is a classical spamc/spamd setup.
All your logging is controlled by command line options on spamd,
see its man page. You may want to run it non-daemonized,
with -D or --debug=all, --syslog=stderr, perhaps with stdout
and stderr jointly redirected to some file.
Comment 13 John Hardin 2009-06-10 08:33:10 UTC
(In reply to comment #10)
> I am getting several hundred failures per day. 

Ouch.

You might want to consider reinstalling the perl executable, it might have become corrupted somehow. That should at least be a troubleshooting step at some point.

I'm not familiar with BSD package management; is there an integrity checker for installed packages?
Comment 14 Joel 2009-06-10 10:36:14 UTC
Created attachment 4460 [details]
spamd log

Here is a log file for a spamd process that crashed.
Comment 15 Joel 2009-06-10 10:45:49 UTC
(In reply to comment #13)
> (In reply to comment #10)
> > I am getting several hundred failures per day. 
> Ouch.
> You might want to consider reinstalling the perl executable, it might have
> become corrupted somehow. That should at least be a troubleshooting step at
> some point.
> I'm not familiar with BSD package management; is there an integrity checker for
> installed packages?

A new port for perl-5.8.9_3 was available and installed yesterday. The tar.bz2 files are checked with MD5 & SHA256 checksums. I also ran 'make test'.

Failed 3 tests out of 1111, 99.73% okay.
        ../lib/Net/Ping/t/450_service.t
        ../lib/Net/Ping/t/510_ping_udp.t
        op/alarm.t
Comment 16 Joel 2009-06-10 10:58:40 UTC
(In reply to comment #12)
> > Thanks. Do you have a sample logging module? I am sure I can come up with one
> > but I am running a production system and it would be helpful to have a sample
> > from the experts to minimize problems.
> Are we talking about sub-spamassassin.pl in qmail-scanner?
> It is not calling SpamAssassin directly, but just forking
> a spamc and feeding mail to it through a pipe. So as far as
> SpamAssassin is concerned, this is a classical spamc/spamd setup.
> All your logging is controlled by command line options on spamd,
> see its man page. You may want to run it non-daemonized,
> with -D or --debug=all, --syslog=stderr, perhaps with stdout
> and stderr jointly redirected to some file.

I experimented with sa-spamd last night. I had a bit of trouble getting the results specified in the man page, but this morning I did notice that spamd was logging to /var/log/debug.log.

# Set defaults
: ${spamd_enable:="NO"}
: ${spamd_flags="-c -Q --debug all -s /var/log/spamd.log"}
: ${spamd_nice="5"}

My review of the logs revealed Bug 6122, https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6122
which I fixed by commenting out the line as indicated.

I also noticed something akin to Bug 497898 where spamd complains that pyzor is not installed, yet it is.

I uploaded a zip file from debug.log detailing all log entries for a pid that crashed. This occured after I commented out the offending line in DKIM.
Comment 17 Karsten Bräckelmann 2009-06-10 11:18:29 UTC
AWL, DB_File backend. The last few lines:

spamd[1189]: auto-whitelist: tie-ing to DB file of type DB_File R/W
 in /home/qscand/.spamassassin/auto-whitelist

spamd[1189]: auto-whitelist: AWL active, pre-score: 14.945,
 autolearn score: 14.945, mean: undef, IP: 79.177.123.207
spamd[1189]: auto-whitelist: add_score: new count: 1, new totscore: 14.945 
spamd[983]: prefork: child 1189: just exited 
spamd[983]: prefork: child 1189: entering state 4 

So what's missing is pretty much this (local run):

[15543] dbg: auto-whitelist: DB addr list: untie-ing and unlocking
[15543] dbg: auto-whitelist: DB addr list: file locked, breaking lock
[15543] dbg: auto-whitelist: post auto-whitelist score: 15.72

That's M::SA::Plugin::AWL and M::SA::DBBasedAddrList.

Joel, can you add some fine grained debugging lines to Plugin/AWL.pm, function check_from_in_auto_whitelist(), please? Specifically between the call of
 $whitelist->add_score($awlpoints);

and the finish() a couple lines down. It might be necessary to recurse into the functions called from DBBasedAddrList.pm also.

Interesting stuff to dump to dbg are variable values like $delta and the result of the sprintf call, etc. The added debugging should show how far exactly we come before the crash happens.
Comment 18 Karsten Bräckelmann 2009-06-10 11:22:06 UTC
Also, -D auto-whitelist should be sufficient rather than all, if you want to get rid of most of the debugging stuff.
Comment 19 Mark Martinec 2009-06-10 11:30:06 UTC
> I uploaded a zip file from debug.log detailing all log entries for
> a pid that crashed.

Thanks. It's a pitty that spamd does not log the exit status of
a failing process (the $? after a waitpid call).
Comment 20 Joel 2009-06-10 11:56:29 UTC
(In reply to comment #18)
> Also, -D auto-whitelist should be sufficient rather than all, if you want to
> get rid of most of the debugging stuff.

OK.

I also noticed that I have Bayes files in two different locations.

/var/spool/spamd/.spamassassin
/usr/home/qscand/.spamassassin

Both directories indicate that they are being accessed.

I don't know if this normal or whether it is a by product of the initial install 2 years ago that was done manually per QmailRocks and then I eventually converted to using the ports system?????
Comment 21 Karsten Bräckelmann 2009-06-10 12:11:32 UTC
> I also noticed that I have Bayes files in two different locations.

Would the spamd running user by chace be called 'spamd' and/or has a HOME dir set to /var/spool/spamd? Did you run spamassassin or spamc as that user for debugging recently?

> Both directories indicate that they are being accessed.

Last access / modifications time? Same as your debugging run?

Anyway, this most likely is unrelated to this bug.
Comment 22 John Hardin 2009-06-10 12:58:10 UTC
(In reply to comment #15)
> (In reply to comment #13)
> > (In reply to comment #10)
> > > I am getting several hundred failures per day. 
> >
> > You might want to consider reinstalling the perl executable, it might have
> > become corrupted somehow. That should at least be a troubleshooting step at
> > some point.
> > I'm not familiar with BSD package management; is there an integrity checker for
> > installed packages?
> 
> A new port for perl-5.8.9_3 was available and installed yesterday. The tar.bz2
> files are checked with MD5 & SHA256 checksums. I also ran 'make test'.
> 
> Failed 3 tests out of 1111, 99.73% okay.
>         ../lib/Net/Ping/t/450_service.t
>         ../lib/Net/Ping/t/510_ping_udp.t
>         op/alarm.t

...I was referring to checking the integrity of the installed executables, but installing a fresh update should address the possibility of file corruption there.

Given comment 17 you might want to check the integrity of your installed Berkeley DB libraries, too.

A corrupt AWL database? But that doesn't excuse SA crashing, SA should deal gracefully with it if that's the root cause. You might want to play with db_verify to see if it's a contributing factor - you'll need to temporarily shut down spamd to check the AWL database.

How often do you restart spamd? Have you restarted it recently (e.g. for the perl upgrade)?
Comment 23 Joel 2009-06-10 15:13:46 UTC
(In reply to comment #17)
> AWL, DB_File backend. The last few lines:
> spamd[1189]: auto-whitelist: tie-ing to DB file of type DB_File R/W
>  in /home/qscand/.spamassassin/auto-whitelist
> spamd[1189]: auto-whitelist: AWL active, pre-score: 14.945,
>  autolearn score: 14.945, mean: undef, IP: 79.177.123.207
> spamd[1189]: auto-whitelist: add_score: new count: 1, new totscore: 14.945 
> spamd[983]: prefork: child 1189: just exited 
> spamd[983]: prefork: child 1189: entering state 4 
> So what's missing is pretty much this (local run):
> [15543] dbg: auto-whitelist: DB addr list: untie-ing and unlocking
> [15543] dbg: auto-whitelist: DB addr list: file locked, breaking lock
> [15543] dbg: auto-whitelist: post auto-whitelist score: 15.72
> That's M::SA::Plugin::AWL and M::SA::DBBasedAddrList.
> Joel, can you add some fine grained debugging lines to Plugin/AWL.pm, function
> check_from_in_auto_whitelist(), please? Specifically between the call of
>  $whitelist->add_score($awlpoints);
> and the finish() a couple lines down. It might be necessary to recurse into the
> functions called from DBBasedAddrList.pm also.
> Interesting stuff to dump to dbg are variable values like $delta and the result
> of the sprintf call, etc. The added debugging should show how far exactly we
> come before the crash happens.

OK, the offending code appears to be :

      # Update the AWL *before* adding the new score, otherwise
      # early high-scoring messages are reinforced compared to
      # later ones.  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=159704
      if (!$pms->{disable_auto_learning}) {
        $whitelist->add_score($awlpoints);
      }

This appears to fail when $meanscore is undefined.

My logs:

Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: tie-ing to DB file of type DB_File R/W in /home/qscand/.spamassassin/auto-whitelist
Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: db-based amount917@tahoestores.com|ip=89.229 scores 0/0
Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: db-based amount917@tahoestores.com|ip=none scores 0/0
Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: AWL active, pre-score: 19.106, autolearn score: 19.106, mean: undef, IP: 89.229.181.196
Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: (Gossamer) meanscore undef
Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: add_score: new count: 1, new totscore: 19.106
Wed Jun 10 15:25:44 2009 [974] info: spamd: handled cleanup of child pid 1187 due to SIGCHLD
Comment 24 Karsten Bräckelmann 2009-06-10 15:59:27 UTC
(In reply to comment #23)
>       if (!$pms->{disable_auto_learning}) {
>         $whitelist->add_score($awlpoints);
>       }
> 
> This appears to fail when $meanscore is undefined.

I don't think so.  $meanscore is undef in my debugging run, too. It merely means that address / originating net-block has not been seen before. Which likely applies to most of your spam...

Moreover, it's not being used in add_score().


> Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: (Gossamer) meanscore undef

What exactly did you log here?

> Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: add_score:
>  new count: 1, new totscore: 19.106

This is M::SA::DBBasedAddrList::add_score(). As I mentioned before, it might be necessary to recurse into those functions adding yet more fine grained dbg statements.

It is necessary. ;)  Also, did you make sure SA did *not* go beyond that block, returning from the add_score() call?

Please add some visible hint to your custom dbg to easily distinguish them.


Hmm, just in case this does fail only for not previously seen AWL entries, how large is your AWL database?
  /home/qscand/.spamassassin/auto-whitelist
Comment 25 Karsten Bräckelmann 2009-06-10 16:06:19 UTC
> This appears to fail when $meanscore is undefined.

If you want to test that suspicion this should do. Similarly you can inject fake messages to spamd, too.

  echo From: foo@6127.bug | spamassassin -D auto-whitelist  2>&1 | less
Comment 26 Karsten Bräckelmann 2009-06-10 16:19:23 UTC
FWIW and to rule out the undef IP in my previous example, you can easily fake the submitting host IP (and thus AWL net-block used). Based on your last real-live debugging output. First, clear the AWL again.

 spamassassin --remove-addr-from-whitelist=foo@6127.bug

Then run with added Received header. Mind the line wraps, single spaces only. BTW, listed quite nicely here in DNSBLs. ;)

 echo -e 'From: foo@6127.bug\nReceived: from 6127.bug (6127.bug [89.229.181.196])
  by localhost\n\n' | spamassassin -D auto-whitelist  2>&1 | less
Comment 27 Joel 2009-06-10 16:48:35 UTC
(In reply to comment #24)
> (In reply to comment #23)
> >       if (!$pms->{disable_auto_learning}) {
> >         $whitelist->add_score($awlpoints);
> >       }
> > 
> > This appears to fail when $meanscore is undefined.
> I don't think so.  $meanscore is undef in my debugging run, too. It merely
> means that address / originating net-block has not been seen before. Which
> likely applies to most of your spam...
> Moreover, it's not being used in add_score().

Agreed. I merely noted my observation that all observed failures were preceded by the fact that $meanscore was undefined. 

> > Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: (Gossamer) meanscore undef
> What exactly did you log here?

See code below.

> > Wed Jun 10 15:25:44 2009 [1187] dbg: auto-whitelist: add_score:
> >  new count: 1, new totscore: 19.106
> This is M::SA::DBBasedAddrList::add_score(). As I mentioned before, it might be
> necessary to recurse into those functions adding yet more fine grained dbg
> statements.
> It is necessary. ;)  

This log entry is the last one generated prior to the crash, I have yet to work my way into add_score(). I didn't know how fine grained you needed me to go.

> Also, did you make sure SA did *not* go beyond that block,
> returning from the add_score() call?

I am positive that we did not return from add_score(). Otherwise I would have logged another message:

      if (defined ($meanscore)) {
        $delta = ($meanscore - $awlpoints) * $pms->{main}->{conf}->{auto_whitelist_factor};

        $pms->set_tag('AWL', sprintf("%2.1f",$delta));
        $pms->set_tag('AWLMEAN', sprintf("%2.1f", $meanscore));
        $pms->set_tag('AWLCOUNT', sprintf("%2.1f", $whitelist->count()));
        $pms->set_tag('AWLPRESCORE', sprintf("%2.1f", $pms->{score}));
      }
      else {
          dbg("auto-whitelist: (Gossamer) meanscore undef");
      }

      # Update the AWL *before* adding the new score, otherwise
      # early high-scoring messages are reinforced compared to
      # later ones.  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=159704
      if (!$pms->{disable_auto_learning}) {
        $whitelist->add_score($awlpoints);
      }

      dbg("auto-whitelist: (Gossamer) bug 159704 fix completed - post auto-whitelist score: ".$pms->{score});

> Please add some visible hint to your custom dbg to easily distinguish them.

I have, my debug messages are tagged with (Gossamer).

> Hmm, just in case this does fail only for not previously seen AWL entries, how
> large is your AWL database?
>   /home/qscand/.spamassassin/auto-whitelist

tahoestores# ls -lat /home/qscand/.spamassassin/auto-whitelist
-rw-rw----  1 qscand  spamd  4398197493760 Jun 10 15:35 /home/qscand/.spamassassin/auto-whitelist

By the way, I disabled the AWL plugin and I have yet to see another segmentation fault.

Thanks for the help. I am not sure when I can return to the debug work, it was unscheduled and I am falling behind on the work I need to get done so I can pay my bills. Although I will admit this work is a bit more satisfying:) Perhaps in the AM.
Comment 28 Karsten Bräckelmann 2009-06-10 17:17:47 UTC
> tahoestores# ls -lat /home/qscand/.spamassassin/auto-whitelist
> -rw-rw----  1 qscand  spamd  4398197493760 Jun 10 15:35

Ye flippin' gods, that is FOUR TERA BYTE.

Actually a fraction larger.  What can your OS and FS handle? Well, what can DB_File handle? I guess we got the issue right there...
Comment 29 Joel 2009-06-10 17:30:29 UTC
(In reply to comment #28)
> > tahoestores# ls -lat /home/qscand/.spamassassin/auto-whitelist
> > -rw-rw----  1 qscand  spamd  4398197493760 Jun 10 15:35
> Ye flippin' gods, that is FOUR TERA BYTE.
> Actually a fraction larger.  What can your OS and FS handle? Well, what can
> DB_File handle? I guess we got the issue right there...

Unlikely. My disk is only 250Gb. When I run du -h the size is only 687M.

Must be an issue with ls.

I did read that you had a problem with larger files.

http://rivviepop.wordpress.com/2006/09/09/spamassassin-bayes-and-berkeley-db/
Comment 30 John Hardin 2009-06-10 17:37:46 UTC
(In reply to comment #29)
> (In reply to comment #28)
> > > tahoestores# ls -lat /home/qscand/.spamassassin/auto-whitelist
> > > -rw-rw----  1 qscand  spamd  4398197493760 Jun 10 15:35
> > Ye flippin' gods, that is FOUR TERA BYTE.
> > Actually a fraction larger.  What can your OS and FS handle? Well, what can
> > DB_File handle? I guess we got the issue right there...
> 
> Unlikely. My disk is only 250Gb. When I run du -h the size is only 687M.
> 
> Must be an issue with ls.

Or a sparse file. Or file corruption of some sort. Now that you have AWL disabled, run db_verify over it and see what happens...
Comment 31 Karsten Bräckelmann 2009-06-10 17:46:24 UTC
(In reply to comment #27)
> This log entry is the last one generated prior to the crash, I have yet to
> work my way into add_score(). I didn't know how fine grained you needed me
> to go.

The purpose is, to spot the last command executing successful. And then looking at the following more closely. ;)

> I am positive that we did not return from add_score(). Otherwise I would have
> logged another message:

>       if (!$pms->{disable_auto_learning}) {
>         $whitelist->add_score($awlpoints);
>       }
> 
>       dbg("auto-whitelist: (Gossamer) bug 159704 fix completed - post
> auto-whitelist score: ".$pms->{score});

Right, that's not in the logs, though the remaining code in DBBasedAddrList.pm add_score() doesn't seem overly complicated or prone to crashes. :/

> > Please add some visible hint to your custom dbg to easily distinguish them.
> 
> I have, my debug messages are tagged with (Gossamer).

Err, right. Did sound familiar, so I guessed it could have come from some var... Which gossamer, gossamer-threads.com?

> tahoestores# ls -lat /home/qscand/.spamassassin/auto-whitelist
> -rw-rw----  1 qscand  spamd  4398197493760 Jun 10 15:35
> /home/qscand/.spamassassin/auto-whitelist

Still 4 Tera Byte -- though I found some hint this might not be it. Still possibly depending on OS, arch, kernel options...

 "GDBM and DB_File are both essentially unlimited. (On 32-bit platforms your
  main limit is the fact that pointers are only able to address 2 GB of data.
  If you have large file support, I don't know what GDBM can do, but DB_File
  is OK out to hundreds of terabytes of data, disk permitting.)"


> By the way, I disabled the AWL plugin and I have yet to see another
> segmentation fault.

That was kind of expected. ;)

Switching the AWL backend to SQL should work, too. And IMHO probably should be strongly preferred, though I didn't use it myself yet.

> Thanks for the help. I am not sure when I can return to the debug work, it was
> unscheduled and I am falling behind on the work I need to get done so I can
> pay my bills. Although I will admit this work is a bit more satisfying:)
> Perhaps in the AM.

Yeah, kind of fun, in a strange way. ;)  As for me, it is getting really late here in DE, and won't be available tomorrow as much as I was today. But still, I'll check back, and there are other knowledgeable devs to pick up from here.
Comment 32 Karsten Bräckelmann 2009-06-10 17:56:14 UTC
> > Ye flippin' gods, that is FOUR TERA BYTE.

> Unlikely. My disk is only 250Gb. When I run du -h the size is only 687M.
> 
> Must be an issue with ls.

Odd. Anyway, that's still quite large, and I'd recommend SQL with a size like that. Or at the very least, occasionally pruning single-time spam sources. There's a handy script for that out there.

> I did read that you had a problem with larger files.
> http://rivviepop.wordpress.com/2006/09/09/spamassassin-bayes-and-berkeley-db/

That's Bayes (not AWL) and *expiry* runs timing out. Well known issue, and typically  the proper solution is to switch to the (Bayes) SQL backend.
Comment 33 Joel 2009-06-10 19:25:25 UTC
(In reply to comment #30)
> (In reply to comment #29)
> > (In reply to comment #28)
> > > > tahoestores# ls -lat /home/qscand/.spamassassin/auto-whitelist
> > > > -rw-rw----  1 qscand  spamd  4398197493760 Jun 10 15:35
> > > Ye flippin' gods, that is FOUR TERA BYTE.
> > > Actually a fraction larger.  What can your OS and FS handle? Well, what can
> > > DB_File handle? I guess we got the issue right there...
> > 
> > Unlikely. My disk is only 250Gb. When I run du -h the size is only 687M.
> > 
> > Must be an issue with ls.
> Or a sparse file. Or file corruption of some sort. Now that you have AWL
> disabled, run db_verify over it and see what happens...

Without knowing the structure of GDBM and DB_File (s) I would also guess a sparse file.

If a db_file is somehow corrupt, would it still open? I tracked down another error message for the failure to open one of the db_file (s) and googled that corruption was the probable culprit. I believe I ran sa-learn --sync & sa-learn --force-expire on the file and the problem went away.

I don't believe I have db_verify installed. Do you know of docs that cover the maintenance of db_file (s). What do I need add to cron to prevent future problems?
Comment 34 Joel 2009-06-10 19:36:17 UTC
(In reply to comment #31)
> (In reply to comment #27)
> > > Please add some visible hint to your custom dbg to easily distinguish them.
> > 
> > I have, my debug messages are tagged with (Gossamer).
> Err, right. Did sound familiar, so I guessed it could have come from some
> var... Which gossamer, gossamer-threads.com?

Actually, Gossamer Computer Services LLC, a one man shop located up in the mountains in a place called Lake Tahoe, CA. I used to farm myself out for C++ and Perl programming gigs, until the dot-com crash. Now I am providing Internet consulting services to small local businesses.

I'm having quite the time digging into your code:) I appreciate your help.
Comment 35 Joel 2009-06-10 19:57:26 UTC
(In reply to comment #32)
> > > Ye flippin' gods, that is FOUR TERA BYTE.
> > Unlikely. My disk is only 250Gb. When I run du -h the size is only 687M.
> > 
> > Must be an issue with ls.
> Odd. Anyway, that's still quite large, and I'd recommend SQL with a size like
> that. Or at the very least, occasionally pruning single-time spam sources.

I am running MySQL for most of my apps. I think the only instances of BDB I have are for SA & Subversion. Looks like SA is using version 3 and I have Subversion using version 42.

> There's a handy script for that out there.
> > I did read that you had a problem with larger files.
> > http://rivviepop.wordpress.com/2006/09/09/spamassassin-bayes-and-berkeley-db/
> That's Bayes (not AWL) and *expiry* runs timing out. Well known issue, and
> typically  the proper solution is to switch to the (Bayes) SQL backend.

I am running MySQL for most of my apps. I think the only instances of BDB I have are for SA & Subversion.

BDB appears to unlimited is size (comment #31) and it must be pretty fleet on its feet since SA is by far the largest consumer of transactions I have on my server.

Would MySQL still be a better choice?

Are there docs to modify SA to use MySQL?

Are there docs outlining any maintenance issues?

Thanks
Comment 36 John Hardin 2009-06-11 09:09:39 UTC
(In reply to comment #33)
>
> I don't believe I have db_verify installed.

db_verify should have been installed with the BDB libraries, along with other BDB utilities like db_dump and db_load.

> Do you know of docs that cover the maintenance of db_file(s).
> What do I need add to cron to prevent future problems?

I have no idea, sorry. I've never had to deal with a corrupt BDB file myself. I've been researching this from scratch to offer you advice here... :)

It's also completely possible that the file _isn't_ corrupt. I'm primarily suggesting this as a data point for troubleshooting the segfault and making mitigation suggestions on the assumption that it is corrupt, while Karsten investigates the SA code itself.

If the file _is_ corrupt, and repairing it (I don't know how, perhaps db_dump followed by db_load) makes the segfault go away, then that suggests the BDB library has a bug. The library shouldn't segfault on corrupt data, it should gracefully return a failure code.

So: run db_verify to check whether the AWL database file is corrupted; if it is, check whether there's an update for your BDB libraries available, and if so, see whether installing that update changes the behavior of the system. If it still crashes, report a bug to the BDB devs. Keep the original file in case they need a repro.

Still, it would be best if SA responds gracefully to DB-File blowing up and logs meaningful error messages while continuing to process the message. Karsten: We know AWL and Bayes will get big, perhaps it would be prudent to wrap their DB_File calls in signal trapping and recover gracefully from major failures?

As it is, I suspect you have four possibilities regardless of any fixes in SA:

(1) Stop using AWL completely;

(2) Attempt to recover the bulk of your AWL database using db_dump and db_load or the AWL maintenance utility script mentioned in comment 32, and resume using AWL with your current config;

(3) Completely wipe your AWL database file, and resume using AWL with your current config;

(4) Reconfigure SA to use MySQL AWL, which will probably result in losing your current AWL data (there may be a data migration procedure, I don't know).

If you continue using AWL you probably should add a periodic process to manage the database size using the maintenance script or corresponding MySQL queries. A 4.4TB AWL database is ridiculous, even if it is a sparse file.
Comment 37 Karsten Bräckelmann 2009-06-11 16:26:27 UTC
(In reply to comment #36)
> Still, it would be best if SA responds gracefully to DB-File blowing up and
> logs meaningful error messages while continuing to process the message.

Without really thinking about this -- I'm not sure if there is any possibility to do that. The perl process (or the lib) itself made an invalid memory reference. Not sure how to catch this if at all, and to continue skipping DB access.

Hmm, now that I said this, there's the ls vs du discrepancy. An invalid memory reference, and a file that's reported to be 4 TB or 600-odd MB... Maybe the underlying issue is much more fundamental to the system. Need to think about this.

Just don't have the time right now. And didn't even read all recent comments carefully. ;)


> Karsten: We know AWL and Bayes will get big, perhaps it would be prudent to
> wrap their DB_File calls in signal trapping and recover gracefully from major
> failures?

Bayes gets roughly about as big as it is set to in the config. It only keeps growing beyond borders, if the expiry run times out. In that case, the admin's choice of BDB backend was't the best...

Growing Bayes DBs are an entirely unrelated topic. Unlike AWL, Bayes generally is bound by DB size, and does have an automatic expiry mechanism.
Comment 38 Mark Martinec 2009-09-03 03:52:47 UTC
(In reply to comment #19)
> > I uploaded a zip file from debug.log detailing all log entries for
> > a pid that crashed.
> 
> Thanks. It's a pitty that spamd does not log the exit status of
> a failing process (the $? after a waitpid call).

  Bug 6150, Bug 6127, Bug 5981, Bug 5950, Bug 6191: let spamd
  log/report a child process exit status or aborting condition
  in an informative way
Sending        lib/Mail/SpamAssassin/Util.pm
Sending        spamd/spamd.raw
Committed revision 810883.
Comment 39 Kevin A. McGrail 2013-01-20 17:24:58 UTC
Believe this is closed in trunk but not updated correctly.