Bug 6698

Summary: New DCC.pm improves performance, is more resilient.
Product: Spamassassin Reporter: Michael Scheidell <michael>
Component: PluginsAssignee: SpamAssassin Developer Mailing List <dev>
Status: RESOLVED FIXED    
Severity: blocker CC: apache, kmcgrail, sidney, vjs
Priority: P2    
Version: 3.3.2   
Target Milestone: 3.4.0   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on:    
Bug Blocks: 6699    
Attachments: /tmp/DCCpatch.txt and DCC.pm
the committed patch (i.e.: svn diff -c1201384)
use -Q instead of -Qmany on incoming mail with existing X-DCC headers

Description Michael Scheidell 2011-11-09 21:47:24 UTC
Created attachment 5010 [details]
/tmp/DCCpatch.txt and DCC.pm

Vernon Schryver <vjs@rhyolite.com>, author of DCC has worked on a new DCC.pm plugin

From his email:



-----Original Message-----
From: Vernon Schryver [mailto:vjs@rhyolite.com] 
Sent: Wednesday, November 09, 2011 12:52 PM
To: Michael Scheidell
Subject: RE: DCC.pm

> Let me try to present them again, since 3.3.3? 3.4? is on the horizon.
>
> Is there a bugzilla open for it yet?  If not, I will open one.

I don't know.

> What are the changes? All upward compatible? Or will something change?

I believe it is entirely upward compatible.


Besides a lot of cleaning of the original ad hoc and now rather crufty perl source,  it has improved error messages, better searching for the dccproc and cdcc programs, and speed improvements from calling them less frequently and using dccsight instead of re-scanning the message.

It also has dcc_learn_score.

That mechanism reports spam detected by SpamAssassin with a score of 'many' to DCC.

A DCC installation has started using it on its incoming mail of about 750K messages/day.  It has had little effect on DCC installations that set DCC thresholds to reasonable values.  Judging from the global graphs, it has noticable effects on the majority that use the old DCC.pm default of 9999999 for 'many'.
The effect is to promote the count for a spam checksum from a substantial number to 'many'

attached is patch (so you can see what changed) and full, DCC.pm.
Comment 1 Mark Martinec 2011-11-11 14:34:16 UTC
This is quite a hefty change. I updated it for the 3.4 (added the few changes
that accumulated between 3.3.1 and the current trunk, plus few minor details),
and it seems to run fine at our site. I haven't tested all possible scenarios
though. The 3.4.0 is probably the best time to accept such a change (if ever)
from the author himself, so I'm tentatively setting the target to 3.4.0.

If others agree, I'd like to commit it to trunk, making it easier for
everybody to test. Are there any CLA concerns?
Comment 2 Kevin A. McGrail 2011-11-11 17:45:07 UTC
(In reply to comment #1)
> This is quite a hefty change. I updated it for the 3.4 (added the few changes
> that accumulated between 3.3.1 and the current trunk, plus few minor details),
> and it seems to run fine at our site. I haven't tested all possible scenarios
> though. The 3.4.0 is probably the best time to accept such a change (if ever)
> from the author himself, so I'm tentatively setting the target to 3.4.0.
> 
> If others agree, I'd like to commit it to trunk, making it easier for
> everybody to test. Are there any CLA concerns?

A) I think this is perfect for 3.4.0
B) There *are* CLA concerns in my opinion.  I will email Vernon. 

Regards,
KAM
Comment 3 Sidney Markowitz 2011-11-12 01:10:51 UTC
I respect Vernon's unhappiness with the documentation's statement that DCC is not "open source" but I disagree with his replacement wording for why the plugin is not enabled by default. I propose the following instead. I don't want to upload an entirely new patch. I don't want to make the change in the existing source code for DCC.pm because that would just cause difficulties when it becomes time to apply this extensive patch.

The portion of his patch that I disagree with:

-Note that DCC is disabled by default in C<init.pre> because it is not
-open source.  See the DCC license for more details.
+See http://www.dcc-servers.net/dcc/ for more information about DCC.

-See http://www.rhyolite.com/anti-spam/dcc/ for more information about
-DCC.
+DCC is disabled by default in C<confdir/v310.pre> because it is free for
+only some but not all commercial users.  See the DCC license for the free
+version at http://www.dcc-servers.net/dcc/LICENSE.

The wording that I would put there:

Note that DCC is disabled by default in C<init.pre> because its license
does not meet the Apache Software Foundation's licensing criteria for
third-party software.

See the DCC license at http://www.dcc-servers.net/dcc/LICENSE
for more details and the ASF criteria at
http://www.apache.org/legal/resolved.html#criteria
Comment 4 Sidney Markowitz 2011-11-12 01:44:54 UTC
I missed Vernon's change from init.pre to v310.pre, which is more correct. As the comments in the pre files say, a line for a plugin goes in the pre file for the version in which it was first released.

This also means that the wording in the comment in v310.pre should be changed to match whatever ends up in DCC.pm
Comment 5 Sidney Markowitz 2011-11-12 05:55:02 UTC
I've been convinced by some off-list emails that it is not be true to say that just because the optional DCC.pm plugin requires the dcc software to work after it is enabled, that enabling the plugin in SpamAssassin without distributing the dcc software would violate the ASF third-party software criteria.

The ASF policy that the criteria relate to says

http://www.apache.org/legal/resolved.html#optional

"Can Apache projects rely on components whose licensing affects the Apache product?

Apache projects cannot distribute any such components. However, if the component is only needed for optional features, a project can provide the user with instructions on how to obtain and install the non-included work. Optional means that the component is not required for standard use of the product or for the product to achieve a desirable level of quality. The question to ask yourself in this situation is:

    'Will the majority of users want to use my product without adding the optional components?'"

That says we cannot distribute the dcc software. We can distribute DCC.pm and make it an optional feature.

However, I agree with not enabling the plugin by default. It is just plain ugly to have it enabled when it cannot work because required software is not installed.

Here is another proposed wording:

Note that DCC is disabled by default in C<v310.pre> because its use requires
software that is not distributed with SpamAssassin and that has license
restrictions for certain commercial uses.
See the DCC license at http://www.dcc-servers.net/dcc/LICENSE for details.
Comment 6 Sidney Markowitz 2011-11-12 23:24:12 UTC
Mark, a CLA has been registered for Vernon, so the patch is ready for you to
commit. Vernon agreed, in an email, to my wording in comment #5, so please use
that.

If you want to keep in a line about "for more information about DCC" in
addition to the link to the license for DCC, make sure that you use the URL
http://www.dcc-servers.net/dcc/

I'm agnostic about whether you really need to include that link, since it will
be obvious to anyone given the license at
http://www.dcc-servers.net/dcc/LICENSE that they can find more information on
dcc in the parent URL. On the other hand, including that link doesn't hurt, so
I would agree with either way.
Comment 7 Mark Martinec 2011-11-13 01:37:36 UTC
> Mark, a CLA has been registered for Vernon, so the patch is ready for you to
> commit.

Great, thanks!

> Vernon agreed, in an email, to my wording in comment #5, so please use that.

Ok, done.

> If you want to keep in a line about "for more information about DCC" in
> addition to the link to the license for DCC, make sure that you use the URL
>   http://www.dcc-servers.net/dcc/
> I'm agnostic about whether you really need to include that link, since it will
> be obvious to anyone given the license at
> http://www.dcc-servers.net/dcc/LICENSE that they can find more information on
> dcc in the parent URL. On the other hand, including that link doesn't hurt, so
> I would agree with either way.

I left the ref to http://www.dcc-servers.net/dcc/ in,
although I wouldn't mind either way.


trunk:
  Bug 6698: New DCC.pm improves performance, is more resilient
  Sending lib/Mail/SpamAssassin/Plugin/DCC.pm
Committed revision 1201384.


Please don't close just yet, I'd still like to make a closer look,
although it appears alright. Additional testing and code review is
welcome if anyone feels like it.
Comment 8 Mark Martinec 2011-11-13 01:45:29 UTC
Created attachment 5011 [details]
the committed patch (i.e.: svn diff -c1201384)

The diff that was committed - for ease of inspection if anyone
cares.  With minor changes from the original proposed patch (updates
for 3.3.1 -> 3.4, some paragraph alignment, and some trivialities).
Comment 9 Vernon Schryver 2011-11-13 15:46:14 UTC
There is a minor issue with the committed change.  Line 523 is now
  $pgmpath = '/var/dcc/' . $pgm;
I think it should be the value that can see in 
http://www.dcc-servers.net/dcc/dcc-tree/misc/DCC.pm 
  $pgmpath = '/usr/local/dcc/' . $pgm;
Or the /var/dcc/ line could be kept and these 2 lines added 
  $pgmpath = '/usr/local/dcc/' . $pgm;
  return $pgmpath if -x $pgmpath;
 
The intent in using /usr/local/dcc was to look in a common 
choice for the DCC `./configure --bindir=DIR` value.
The default value is /usr/local/bin

I apologize for not noticing this issue before the change 
was committed.


A separate issue is what should be done with the copy of DCC.pm
that has been in versions of DCC for the last 18 months.
It's always asking for trouble to have 2 copies of a value,
not to mention code.  I could remove the copy in the DCC source
after the SpamAssassin version is released.  When might that be?
I expect to make a DCC release in the next few months.
Deleting the copy of DCC.pm in the DCC source would inconvenience
installations that are looking in the DCC tarball for DCC.pm.  
Some of them are using the fact that the DCC version of DCC.pm
is actually generated by the DCC ./configure script.  See 
http://www.dcc-servers.net/dcc/dcc-tree/misc/DCC.pm.in
Do you have any suggestions?
Comment 10 Mark Martinec 2011-11-14 19:27:44 UTC
> There is a minor issue with the committed change.  Line 523 is now
>   $pgmpath = '/var/dcc/' . $pgm;
> I think it should be the value that can see in 
> http://www.dcc-servers.net/dcc/dcc-tree/misc/DCC.pm 
>   $pgmpath = '/usr/local/dcc/' . $pgm;
> Or the /var/dcc/ line could be kept and these 2 lines added 
>   $pgmpath = '/usr/local/dcc/' . $pgm;
>   return $pgmpath if -x $pgmpath;
>
> The intent in using /usr/local/dcc was to look in a common 
> choice for the DCC `./configure --bindir=DIR` value.
> The default value is /usr/local/bin

I left in both, added some debugging, and streamlined a bit.
Please see what came out  (svn diff -c1201835) :

trunk:
  Bug 6698#9: New DCC.pm improves performance, is more resilient'       
  Sending lib/Mail/SpamAssassin/Plugin/DCC.pm
Committed revision 1201835.

> A separate issue is what should be done with the copy of DCC.pm
> that has been in versions of DCC for the last 18 months.
> It's always asking for trouble to have 2 copies of a value,

Agree.  As the DCC.pm plugin is only usable with SpamAssassin,
and has been distributed with it since I remember, I see no point
of having another copy of this plugin distributed with dcc.

> not to mention code.  I could remove the copy in the DCC source
> after the SpamAssassin version is released.  When might that be?

Soon ;)

> I expect to make a DCC release in the next few months.
> Deleting the copy of DCC.pm in the DCC source would inconvenience
> installations that are looking in the DCC tarball for DCC.pm.  
> Some of them are using the fact that the DCC version of DCC.pm
> is actually generated by the DCC ./configure script.  See 
> http://www.dcc-servers.net/dcc/dcc-tree/misc/DCC.pm.in
> Do you have any suggestions?

I think it is mostly up to you to decide. I'd ditch the DCC.pm.in .
Comment 11 Vernon Schryver 2011-11-14 20:48:35 UTC
> I left in both, added some debugging, and streamlined a bit.
> Please see what came out  (svn diff -c1201835) :

Because 
     $conf->{dcc_home}
can be undefined, I think you really need something like
     my $home = $conf->{dcc_home} || '/var/dcc';
and then using $home instead of $conf->{dcc_home} in line 516.
If the user has not set $conf->{dcc_home}, won't you at best
get undesired results from such as 
     ($conf->{dcc_home}.'/bin'
There is a similar prlbme with the use of $conf->{dcc_libexec}
in line 516 that will cause surprises or perl interpreter
croaking in line 518 if $conf->{dcc_libexec} is not defined by
the user.

I don't understand how some of the changes to regular 
expressions or replacing
     if ($x) {
with
     if (defined $x && $x ne '') {
qualifies as streamlining, but there's no accounting for
tastes and it's now your code.
Comment 12 Vernon Schryver 2011-11-14 20:52:59 UTC
Never mind what I just wrote about $conf->{dcc_home}
and $conf->{dcc_libexec} possibly being undefined.
I now think they're always set to something in ck_dir().
Comment 13 Kevin A. McGrail 2011-11-14 21:31:16 UTC
Making this a blocker for 3.4.0 as it contains code that needs a CLA which is in progress.
Comment 14 Sidney Markowitz 2011-11-14 21:39:45 UTC
(in reply to comment #13)

Kevin, we have a CLA for Vernon. An acknowledgment of receipt was sent to the PMC, and Vernon's name appears on the committers page in the list of non-committers who have sent a CLA. Is there any other CLA that this patch depends on?
Comment 15 Kevin A. McGrail 2011-11-14 21:46:55 UTC
(In reply to comment #14)
> (in reply to comment #13)
> 
> Kevin, we have a CLA for Vernon. An acknowledgment of receipt was sent to the
> PMC, and Vernon's name appears on the committers page in the list of
> non-committers who have sent a CLA. Is there any other CLA that this patch
> depends on?

You are correct but there are two CLAs in the works and I believe we are still waiting on the Corporate CLA for Rhyolite.  I corresponded w/ Craig Russel at 11:51AM EST today (about 5 hours ago) and he was waiting on a corrected version as it was on the wrong form.  I emailed Vernon to ping him as well.

If you review emails to PMC you'll note Subject 1: Re: Individual Contributor License Agreement from Vernon Schryver vs Subject 2: Re: Individual Contributor License Agreement from Rhyolite Software

regards,
KAM
Comment 16 Vernon Schryver 2011-11-14 22:14:00 UTC
Please also forget my comment about !$x versus
defined $x && $x ne ''
I now realize that the intent was probably to
allow files and directories named '0'.
Comment 17 Kevin A. McGrail 2011-11-14 22:19:05 UTC
(In reply to comment #16)
> Please also forget my comment about !$x versus
> defined $x && $x ne ''
> I now realize that the intent was probably to
> allow files and directories named '0'.

That's a good edge case actually but you can also have a variable defined but empty.  So this is a test to see is it defined and is it empty.  Unfortunately, with parameter sanitization, this case comes up more than it should.
Comment 18 Sidney Markowitz 2011-11-14 22:30:48 UTC
(in reply to comment #15)

Ah, I interpreted the two emails as that he first sent an Individual CLA form filled out for himself with his DBA business name, then corrected it by sending an ICLA form filled out as an individual. You are saying that we also need a corporate CLA from Rhyolite Software.
Comment 19 Kevin A. McGrail 2011-11-14 22:32:07 UTC
(In reply to comment #18)
> (in reply to comment #15)
> 
> Ah, I interpreted the two emails as that he first sent an Individual CLA form
> filled out for himself with his DBA business name, then corrected it by sending
> an ICLA form filled out as an individual. You are saying that we also need a
> corporate CLA from Rhyolite Software.

Yes.  And that second CLA is winging it's way through the process now.
Comment 20 Sidney Markowitz 2011-11-14 22:42:36 UTC
(in reply to comment #19)

And it has arrived. All CLA issues are now completed.
Comment 21 Kevin A. McGrail 2011-11-14 23:13:15 UTC
(In reply to comment #20)
> (in reply to comment #19)
> 
> And it has arrived. All CLA issues are now completed.

I've updated Vernon's CLA status in bugzilla.  Closing this bug as resolved/fixed but feel free to reopen if I missed more things in discussion.
Comment 22 Mark Martinec 2011-11-15 00:57:10 UTC
> I think you really need something like
>      my $home = $conf->{dcc_home} || '/var/dcc';
> and then using $home instead of $conf->{dcc_home} in line 516.

Sure, if you want. I removed the "my $home=..." from your code
as it was not used anywhere.

> I don't understand how some of the changes to regular 
> expressions or replacing
>     if ($x) {
> with
>      if (defined $x && $x ne '') {
> qualifies as streamlining

I was referring to 5 unrolled copies of:
   $pgmpath = ...
   return $pgmpath if (-x $pgmpath);
being replaced by a foreach loop.


> Never mind what I just wrote about $conf->{dcc_home}
> and $conf->{dcc_libexec} possibly being undefined.
> I now think they're always set to something in ck_dir().

I didn't check this. If we are not sure we can put back the test
for defined() back. I only checked that $conf->{dcc_home} already
is untainted by the config parsing code, so the call to
untaint_file_path was redundant.


> Please also forget my comment about !$x versus
> defined $x && $x ne ''
> I now realize that the intent was probably to
> allow files and directories named '0'.

Exactly! This is one of my pet peeves! Around me it is not uncommon
that some throwaway/test program is kept on a file named "0".
Surprisingly often this reveals a bug in some program.
In the same vein we had a bug in SpamAssassin, where a "Subject: 0"
was marked as a missing Subject header field!

This is described under Bug 5965:
- header field with a value of "0" (zero) invisible;
- do not treat user data as perl booleans (a string "0" is a false);
- continue work on avoiding user data to be tested as perl booleans,
  instead test for defined or for an empty string as appropriate;
Comment 23 Vernon Schryver 2011-12-26 17:19:16 UTC
Created attachment 5024 [details]
use -Q instead of -Qmany on incoming mail with existing X-DCC headers

This patch originally from Herbert J. Skuhra fixes a significant problem
seen with mail with pre-existing X-DCC headers.
Comment 24 Mark Martinec 2011-12-26 22:09:30 UTC
> Created attachment 5024 [details]
> use -Q instead of -Qmany on incoming mail with existing X-DCC headers
> This patch originally from Herbert J. Skuhra fixes a significant problem
> seen with mail with pre-existing X-DCC headers.

Thanks, committed:

Bug 6698 #23 - use -Q instead of -Qmany
on incoming mail with existing X-DCC headers
  Sending lib/Mail/SpamAssassin/Plugin/DCC.pm
Committed revision 1224826.
Comment 25 Henrik Krohns 2012-04-01 14:28:33 UTC
Fixed the annoying uninitialized dcc_libexec when cdcc binary is not in path. All the path search stuff here is too confusing to make a more "elegant" fix.

Also a pointless $dcc_libexec removed.

--- DCC.pm      (revision 1307734)
+++ DCC.pm      (working copy)
@@ -425,7 +425,6 @@

 sub find_dcc_home {
   my ($self) = @_;
-  my $dcc_libexec;

   # just once
   return if defined $self->{dcc_version};
@@ -515,6 +514,7 @@
   # try dcc_home/bin, dcc_libexec, and some desperate last attempts
   foreach my $dir ($conf->{dcc_home}.'/bin',  $conf->{dcc_libexec},
                    '/usr/local/bin', '/usr/local/dcc', '/var/dcc') {
+    next unless defined $dir;
     $pgmpath = $dir . '/' . $pgm;
     if (-x $pgmpath) {
       dbg("dcc: dcc_pgm_path, found %s in %s: %s", $pgm,$dir,$pgmpath);


Sending        DCC.pm
Transmitting file data .
Committed revision 1308124.
Comment 26 Vernon Schryver 2013-09-04 14:16:42 UTC
There is a bug in this version of DCC.pm that is fixed with this change:

--- DCC.pm.old  2013-09-04 14:15:32.000000000 +0000
+++ DCC.pm      2013-09-04 14:15:14.000000000 +0000
@@ -825,8 +825,10 @@
        $opts = $self->{dccifd_report_options}
       } else {
        $opts = $self->{dccifd_lookup_options};
-       # only query if there is an X-DCC header
-       $opts =~ s/grey-off/& query/ if defined $permsgstatus->{dcc_raw_x_dcc};
+       if defined $permsgstatus->{dcc_raw_x_dcc} {
+         # only query if there is an X-DCC header
+         $opts =~ s/grey-off/grey-off query/
+       }
       }
       $sock->print($opts)         or die "failed write options\n";
       $sock->print($client . "\n") or die "failed write SMTP client\n";
Comment 27 Mark Martinec 2013-09-04 18:27:34 UTC
(In reply to Vernon Schryver from comment #26)
> There is a bug in this version of DCC.pm that is fixed with this change:

Thanks for noticing and the fix! Applied.

trunk:
Bug 6698#26: DCC dcc_raw_x_dcc fixed
  Sending lib/Mail/SpamAssassin/Plugin/DCC.pm
Committed revision 1520091.