Bug 5511

Summary:	[review] Dns.pm: timeout zero while collecting DNS responses
Product:	Spamassassin	Reporter:	Mark Martinec <Mark.Martinec>
Component:	Libraries	Assignee:	SpamAssassin Developer Mailing List <dev>
Status:	RESOLVED FIXED
Severity:	normal
Priority:	P3
Version:	3.2.0
Target Milestone:	3.2.2
Hardware:	All
OS:	All
Whiteboard:	3985 can be committed, 3991 can be comitted
Bug Depends on:
Bug Blocks:	5585
Attachments:	the promised patch The additional patch, filling up my 'feature request'

Description Mark Martinec 2007-06-12 09:13:41 UTC

While investigating why certain type of spam sometimes does
not receive its deserved URIDNSBL hits and sometimes also
misses RBL lookup results, I noticed the following telltale
pattern:
  [42083] dbg: dns: success for 21 of 22 queries
  [42083] dbg: dns: timeout for after 0 seconds

The 'after 0 seconds' looked very suspicious, and digging behind
scenes reveals one bug, one formula calling for improvement,
and one enhancement request (and also sprouted a Bug 5509 report).

To understand the bug the following scenario can be considered,
affixing a typical timestamp (time from the start) with each item:

- 1 s shortly after start of processing, the first round of RBL
      and other DNS queries gets launched;
- 1..5 s processing continues with rules, razor, dcc, botnet, etc.,
      while DNS results begin to drop in;
- 5 s sub harvest_dnsbl_queries is called, it computes $deadline
      and enters its while loop, collecting responses;
      Note that $deadline is computed from the time the last DNS
      query was sent (at 1 s timestamp).
- 5 s Now here is the catch: the $self->{async}->complete_lookups(1)
      within the 'while' expression calls registered 'completed_callback',
      which (as in Plugin::URIDNSBL.pm) can call start_lookup() again,
      producing new queries, $self->{async}->get_last_start_lookup_time
      receives a new timestamp, but, alas, the $deadline is not recomputed!
- 6 s The second round of DNS queries times out instantaneously (when
      rbl_timeout is 6 or less), and they never get a chance of succeeding,
      even if responses arrive almost immediately.

The solution is to recompute $deadline after every call of 
{async}->complete_lookups.  I'll attach a patch that does it.


Now to the 'formula calling for improvement' part.

Mail::SpamAssassin::Conf man page describes the dynamic reduction of
rbl_timeout timeout, but promises the timeout never falls below 1:

  In addition, whenever the effective timeout is lowered due to addi-
  tional query results returning, the remaining queries are always
  given at least one more second before timing out, ...

That promise is broken on occasion, for example with this particular
type of spam, 22 dns queries are launched, and if I have rbl_timeout
at 5, the dynamic timeout falls to zero for collecting the last dns
reply (which did arrive, but is never reclaimed). And the last reply
is often the one from URIDNSBL checks, sometimes the only worthwhile
hit that such spam receives.

I think the formula needs to be revised: it should not reduce timeout
to near-zero when remaining queries fall to 1, but perhaps to some
fraction (say: one third) of the original rbl_timeout. And regardless,
the one second minimum must be kept.

I'm now using the following formula, which keeps this promise:

    my $dynamic = (int($self->{conf}->{rbl_timeout}
                      * (1 - 0.7*(($total - @left) / $total) ** 2) + 1)

by dropping time limit to no less than 0.3 of the original value,
offset by 0.5 (with additional 0.5 for rounding). I'll include this
in my match.


Now for the last part - a feature request.  I think that no attempt
has been made to collect already received DNS responses when
timeout is reached. Given an asynchronous nature of DNS lookups
in this module, I think it would be worthwhile to collect whatever
is still in the IP receive queue after a timeout.

Thanks for the patience of reading this far  :)

Comment 1 Mark Martinec 2007-06-12 09:15:00 UTC

Created attachment 3985 [details]
the promised patch

Comment 2 Justin Mason 2007-06-12 09:35:31 UTC

thanks Mark, will look at these

Comment 3 Mark Martinec 2007-06-12 15:55:51 UTC

Thoughts for the future:

From a broader perspective, it would probably be desirable to
launch the second round of DNS queries (by Plugin::URIDNSBL)
shortly after plain RBL lookups, i.e. before Razor, dcc, Botnet
evaluation starts, to give them more time to collect results
(or to avoid stretching time limit further).

Other plugins could benefit from being split into two parts,
e.g. Plugin::DKIM could launch a DNS query for a signer's
public key and policy after checking a mail header, but leave
signature verification for later, to give DNS queries time
to collect results.

Comment 4 Mark Martinec 2007-06-12 16:34:03 UTC

Copying my reply on SA user list, topic 'These are getting through SA...'.

Bill Landry,
> Mark, I patched Dns.pm but this didn't resolve the issue for me.
> You can test with the sample messages I posted to bugzilla:
  Bug 5506


Yes, it is the same problem as I describe in Bug 5511
but to fix it requires my 'feature request' to be implemented:

  Now for the last part - a feature request.  I think that no attempt
  has been made to collect already received DNS responses when
  timeout is reached. Given an asynchronous nature of DNS lookups
  in this module, I think it would be worthwhile to collect whatever
  is still in the IP receive queue after a timeout.


The problem with BOTNET is that it tries to do a reverse DNS
lookup on 66.17.235.109, which has broken DNS servers and
none are reachable, so it hangs in sub get_rdns query($ip,'PTR','IN')
for 24 seconds.

After BOTNET finally times out, the Dns.pm harvest_dnsbl_queries
tries to collect its RBL results, but abandons all attempts right away
because it sees that 24 seconds has passed by, and just declares
them timed out, despite the fact that DNS results are waiting
in TCP/IP received queue.

Comment 5 Justin Mason 2007-06-14 11:18:40 UTC

ok, that patch makes sense. applied to trunk:

: jm 176...; svn commit -m "bug 5511: recompute the DNS lookup deadline after
every call to {async}->complete_lookups(), to avoid timing out too early; also,
never allow the deadline to be recomputed to 0 in contradiction of the
documentation. thanks to Mark Martinec" lib/Mail/SpamAssassin/Dns.pm
Sending        lib/Mail/SpamAssassin/Dns.pm
Transmitting file data .
Committed revision 547344.

+1 to apply to 3.2.x...

I agree, the feature request makes sense, too. I'd apply that if someone coded
it up ;)

Comment 6 Mark Martinec 2007-06-14 12:10:15 UTC

> ok, that patch makes sense. applied to trunk

Thanks!

> I agree, the feature request makes sense, too.
> I'd apply that if someone coded it up ;)

I came up with something simple for starters, will check it out first
and report.

Comment 7 Theo Van Dinter 2007-06-14 12:50:53 UTC

just to keep it in mind -- if these major kind of bugs also affect 3.1, we
should backport the fix there as well.

Comment 8 Mark Martinec 2007-06-14 17:46:54 UTC

> I came up with something simple for starters, will check it out first
> and report.

Ok, here it comes, the attached patch applies on top of my previous one.

This one is quite simple: if harvest_dnsbl_queries() happens to be called
very late - when rbl_timeout has already expired - then extend the deadline
for an additional second so that dns responses (which are likely to be
already waiting in a TCP/IP receive queue) can be collected.

Btw, the %f formatting in a debug printout is in preparation for possible
future adoption of Time::HiRes more widely (currently some modules use
it if available, otherwise they fall back to a builtin time() ).
Having more precise time would be nice here as well, when dealing
with near- one second intervals.

Back to the problem the patch is fixing: Such a situation occurs usually
when Razor or Dcc servers are unresponsive, or when rules take long time
to compute (longer than rbl_tmeout), or when Botnet gets stuck on checking
a rDNS on some IP with broken DNS servers. Previously this caused all
RBL and URIBL responses to be ignored.

Btw, initially I had something more sophisticated in mind (like using
a select to see if a receive queue is empty), but this could potentially
open up a DoS hole by arbitrarily extending processing time by throwing
bogus DNS responses to SA, so some additional time limit would still be
needed - which brings us back to the simple solution.

Comment 9 Mark Martinec 2007-06-14 17:48:36 UTC

Created attachment 3991 [details]
The additional patch, filling up my 'feature request'

Comment 10 Mark Martinec 2007-06-14 17:52:13 UTC

Btw, I have provided similar code change for harvest_until_rule_completes as
for harvest_dnsbl_queries. Actually only the second one harvest_dnsbl_queries
is problematic, I don't know if the change is needed for the first one too.
I guess it can't hurt to keep both routines mostly in sync.

Comment 11 Justin Mason 2007-06-15 05:32:24 UTC

I think that looks good, too: +1

Comment 12 Sidney Markowitz 2007-06-15 17:56:51 UTC

+1 on both patches

Comment 13 Doc Schneider 2007-06-15 18:09:12 UTC

+1 on both

Comment 14 Justin Mason 2007-06-18 09:27:36 UTC

trunk:

: jm 176...; svn commit -m "bug 5511: fix network lookup timeouts, where lookups
were being lost once a timeout was hit; also fix code to match documentation on
rbl_timeout's scaling and minimum duration of 1 second; and attempt to collect
already-received DNS responses when the timeout is reached. thanks to Mark
Martinec" lib/Mail/SpamAssassin/Dns.pm
Sending        lib/Mail/SpamAssassin/Dns.pm
Transmitting file data .
Committed revision 548397.

3.2.x:

: jm 158...; svn commit -m "bug 5511: fix network lookup timeouts, where lookups
were being lost once a timeout was hit; also fix code to match documentation on
rbl_timeout's scaling and minimum duration of 1 second; and attempt to collect
already-received DNS responses when the timeout is reached. thanks to Mark
Martinec" lib/Mail/SpamAssassin/Dns.pm
Sending        lib/Mail/SpamAssassin/Dns.pm
Transmitting file data .
Committed revision 548396.

Comment 15 Andrew Vasilyev 2007-08-03 08:30:32 UTC

This change of network lookups timeouts dramatically increases the time of 
scan.

Sequential scanning of 1494 messages on the same machine (FreeBSD 6.2, Perl 
5.8.8), with the same config (rbl_timeout=1, use_bayes=0, use_autowhitelist=1,
skip_rbl_checks={0,1}) gives:

3.2.1: 333 seconds with skip_rbl_checks=0; 146 seconds skip_rbl_checks=1
3.2.2: 643 seconds with skip_rbl_checks=0; 168 seconds skip_rbl_checks=1

Comment 16 Mark Martinec 2007-08-08 09:58:14 UTC

> This change of network lookups timeouts dramatically increases the time
> of scan.
> 3.2.1: 333 seconds with skip_rbl_checks=0; 146 seconds skip_rbl_checks=1
> 3.2.2: 643 seconds with skip_rbl_checks=0; 168 seconds skip_rbl_checks=1 

Well, you can either have results quick but miss many, or let DNS responses
be harvested in reasonable time, which will increase the total elapsed
time somewhat, but produce more accurate results.

Btw, latency is usually not a problem, as it can be easily compensated
by increasing the number of parallel processes, yielding the same
maximal throughput the machine can handle.

To remind ourselves what is the core problem here: some asynchronous
DNS lookup requests carry a completion routine which then spawns
additional DNS queries. This happens currently in Plugin::URIDNSBL,
but could in principle happen in any completion routine. For example,
a lookup on a NS record spawns an additional lookup for an A record,
but not before harvest_dnsbl_queries happens, which is quite late,
so not much time remains for collecting responses to these new queries.
Same goes for URIDNS lookups such as sbl.spamhaus.org, cn-kr.blackholes.us
etc.

What happened before the patch is that most of these additional
lookup results were simply discarded if other rules took long time
to complete (longer than rbl_timeout), which was a real waste
of good information.

A solution is in processing completion routines (which may spawn
new queries) much sooner and gradually, as responses trickle in,
and not wait until priority 500, which is almost at the end of
mail processing.

I have a patch (for 3.2.2) which accomplishes this. It is rather
substantial (but compatible with the rest of the code), touching
AsyncLoop.pm mostly, but also Dns.pm, DnsResolver.pm, Plugin/Check.pm
and Plugin/URIDNSBL.pm. It is certainly not something that could go
into 3.2 series. I'll let it simmer here on our production machine
for a couple of days, just in case, then make it available.

Btw, is trunk in C-T-R mode now? Do I open a new bug entry, post
a patch, then commit to trunk in few days without a vote (barring
objections of course)?

Comment 17 Justin Mason 2007-08-08 10:10:01 UTC

(In reply to comment #16)
> Btw, is trunk in C-T-R mode now? Do I open a new bug entry, post
> a patch, then commit to trunk in few days without a vote (barring
> objections of course)?

yep, trunk is in C-T-R mode.  for really large or controversial changes,
you can open a bug and post a patch (and/or create a branch); otherwise,
if it's a minor change that won't be controversial, feel free to commit it.

in this kind of situation, opening a bug serves mainly to create a venue
for discussion.

This change sounds pretty big, but probably quite uncontroversial (assuming
it doesn't change any of the public POD-documented APIs), so you may not
need to open a bug... see what you think applies.

Comment 18 Mark Martinec 2007-08-10 08:31:53 UTC

> This change sounds pretty big, but probably quite uncontroversial (assuming
> it doesn't change any of the public POD-documented APIs), so you may not
> need to open a bug... see what you think applies.

Followups to Bug 5589.