|
SA Bugzilla – Full Text Bug Listing |
Summary: | [review] Dns.pm: timeout zero while collecting DNS responses | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Mark Martinec <Mark.Martinec> |
Component: | Libraries | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | P3 | ||
Version: | 3.2.0 | ||
Target Milestone: | 3.2.2 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | 3985 can be committed, 3991 can be comitted | ||
Bug Depends on: | |||
Bug Blocks: | 5585 | ||
Attachments: |
the promised patch
The additional patch, filling up my 'feature request' |
Description
Mark Martinec
2007-06-12 09:13:41 UTC
Created attachment 3985 [details]
the promised patch
thanks Mark, will look at these Thoughts for the future: From a broader perspective, it would probably be desirable to launch the second round of DNS queries (by Plugin::URIDNSBL) shortly after plain RBL lookups, i.e. before Razor, dcc, Botnet evaluation starts, to give them more time to collect results (or to avoid stretching time limit further). Other plugins could benefit from being split into two parts, e.g. Plugin::DKIM could launch a DNS query for a signer's public key and policy after checking a mail header, but leave signature verification for later, to give DNS queries time to collect results. Copying my reply on SA user list, topic 'These are getting through SA...'. Bill Landry, > Mark, I patched Dns.pm but this didn't resolve the issue for me. > You can test with the sample messages I posted to bugzilla: Bug 5506 Yes, it is the same problem as I describe in Bug 5511 but to fix it requires my 'feature request' to be implemented: Now for the last part - a feature request. I think that no attempt has been made to collect already received DNS responses when timeout is reached. Given an asynchronous nature of DNS lookups in this module, I think it would be worthwhile to collect whatever is still in the IP receive queue after a timeout. The problem with BOTNET is that it tries to do a reverse DNS lookup on 66.17.235.109, which has broken DNS servers and none are reachable, so it hangs in sub get_rdns query($ip,'PTR','IN') for 24 seconds. After BOTNET finally times out, the Dns.pm harvest_dnsbl_queries tries to collect its RBL results, but abandons all attempts right away because it sees that 24 seconds has passed by, and just declares them timed out, despite the fact that DNS results are waiting in TCP/IP received queue. ok, that patch makes sense. applied to trunk: : jm 176...; svn commit -m "bug 5511: recompute the DNS lookup deadline after every call to {async}->complete_lookups(), to avoid timing out too early; also, never allow the deadline to be recomputed to 0 in contradiction of the documentation. thanks to Mark Martinec" lib/Mail/SpamAssassin/Dns.pm Sending lib/Mail/SpamAssassin/Dns.pm Transmitting file data . Committed revision 547344. +1 to apply to 3.2.x... I agree, the feature request makes sense, too. I'd apply that if someone coded it up ;) > ok, that patch makes sense. applied to trunk Thanks! > I agree, the feature request makes sense, too. > I'd apply that if someone coded it up ;) I came up with something simple for starters, will check it out first and report. just to keep it in mind -- if these major kind of bugs also affect 3.1, we should backport the fix there as well. > I came up with something simple for starters, will check it out first
> and report.
Ok, here it comes, the attached patch applies on top of my previous one.
This one is quite simple: if harvest_dnsbl_queries() happens to be called
very late - when rbl_timeout has already expired - then extend the deadline
for an additional second so that dns responses (which are likely to be
already waiting in a TCP/IP receive queue) can be collected.
Btw, the %f formatting in a debug printout is in preparation for possible
future adoption of Time::HiRes more widely (currently some modules use
it if available, otherwise they fall back to a builtin time() ).
Having more precise time would be nice here as well, when dealing
with near- one second intervals.
Back to the problem the patch is fixing: Such a situation occurs usually
when Razor or Dcc servers are unresponsive, or when rules take long time
to compute (longer than rbl_tmeout), or when Botnet gets stuck on checking
a rDNS on some IP with broken DNS servers. Previously this caused all
RBL and URIBL responses to be ignored.
Btw, initially I had something more sophisticated in mind (like using
a select to see if a receive queue is empty), but this could potentially
open up a DoS hole by arbitrarily extending processing time by throwing
bogus DNS responses to SA, so some additional time limit would still be
needed - which brings us back to the simple solution.
Created attachment 3991 [details]
The additional patch, filling up my 'feature request'
Btw, I have provided similar code change for harvest_until_rule_completes as for harvest_dnsbl_queries. Actually only the second one harvest_dnsbl_queries is problematic, I don't know if the change is needed for the first one too. I guess it can't hurt to keep both routines mostly in sync. I think that looks good, too: +1 +1 on both patches +1 on both trunk: : jm 176...; svn commit -m "bug 5511: fix network lookup timeouts, where lookups were being lost once a timeout was hit; also fix code to match documentation on rbl_timeout's scaling and minimum duration of 1 second; and attempt to collect already-received DNS responses when the timeout is reached. thanks to Mark Martinec" lib/Mail/SpamAssassin/Dns.pm Sending lib/Mail/SpamAssassin/Dns.pm Transmitting file data . Committed revision 548397. 3.2.x: : jm 158...; svn commit -m "bug 5511: fix network lookup timeouts, where lookups were being lost once a timeout was hit; also fix code to match documentation on rbl_timeout's scaling and minimum duration of 1 second; and attempt to collect already-received DNS responses when the timeout is reached. thanks to Mark Martinec" lib/Mail/SpamAssassin/Dns.pm Sending lib/Mail/SpamAssassin/Dns.pm Transmitting file data . Committed revision 548396. This change of network lookups timeouts dramatically increases the time of scan. Sequential scanning of 1494 messages on the same machine (FreeBSD 6.2, Perl 5.8.8), with the same config (rbl_timeout=1, use_bayes=0, use_autowhitelist=1, skip_rbl_checks={0,1}) gives: 3.2.1: 333 seconds with skip_rbl_checks=0; 146 seconds skip_rbl_checks=1 3.2.2: 643 seconds with skip_rbl_checks=0; 168 seconds skip_rbl_checks=1 > This change of network lookups timeouts dramatically increases the time
> of scan.
> 3.2.1: 333 seconds with skip_rbl_checks=0; 146 seconds skip_rbl_checks=1
> 3.2.2: 643 seconds with skip_rbl_checks=0; 168 seconds skip_rbl_checks=1
Well, you can either have results quick but miss many, or let DNS responses
be harvested in reasonable time, which will increase the total elapsed
time somewhat, but produce more accurate results.
Btw, latency is usually not a problem, as it can be easily compensated
by increasing the number of parallel processes, yielding the same
maximal throughput the machine can handle.
To remind ourselves what is the core problem here: some asynchronous
DNS lookup requests carry a completion routine which then spawns
additional DNS queries. This happens currently in Plugin::URIDNSBL,
but could in principle happen in any completion routine. For example,
a lookup on a NS record spawns an additional lookup for an A record,
but not before harvest_dnsbl_queries happens, which is quite late,
so not much time remains for collecting responses to these new queries.
Same goes for URIDNS lookups such as sbl.spamhaus.org, cn-kr.blackholes.us
etc.
What happened before the patch is that most of these additional
lookup results were simply discarded if other rules took long time
to complete (longer than rbl_timeout), which was a real waste
of good information.
A solution is in processing completion routines (which may spawn
new queries) much sooner and gradually, as responses trickle in,
and not wait until priority 500, which is almost at the end of
mail processing.
I have a patch (for 3.2.2) which accomplishes this. It is rather
substantial (but compatible with the rest of the code), touching
AsyncLoop.pm mostly, but also Dns.pm, DnsResolver.pm, Plugin/Check.pm
and Plugin/URIDNSBL.pm. It is certainly not something that could go
into 3.2 series. I'll let it simmer here on our production machine
for a couple of days, just in case, then make it available.
Btw, is trunk in C-T-R mode now? Do I open a new bug entry, post
a patch, then commit to trunk in few days without a vote (barring
objections of course)?
(In reply to comment #16) > Btw, is trunk in C-T-R mode now? Do I open a new bug entry, post > a patch, then commit to trunk in few days without a vote (barring > objections of course)? yep, trunk is in C-T-R mode. for really large or controversial changes, you can open a bug and post a patch (and/or create a branch); otherwise, if it's a minor change that won't be controversial, feel free to commit it. in this kind of situation, opening a bug serves mainly to create a venue for discussion. This change sounds pretty big, but probably quite uncontroversial (assuming it doesn't change any of the public POD-documented APIs), so you may not need to open a bug... see what you think applies. > This change sounds pretty big, but probably quite uncontroversial (assuming > it doesn't change any of the public POD-documented APIs), so you may not > need to open a bug... see what you think applies. Followups to Bug 5589. |