Bug 6075 - dnsbl checks time out
Summary: dnsbl checks time out
Status: RESOLVED INVALID
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: 3.2.5
Hardware: PC Linux
: P5 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-27 05:19 UTC by Elsa Andrés
Modified: 2009-03-03 22:24 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
tcpdump application/x-zip-compressed None Elsa Andrés [NoCLA]
SA test with named application/x-zip-compressed None Elsa Andrés [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Elsa Andrés 2009-02-27 05:19:11 UTC
I am unable to perform dnsbl checks. 

Running "spamassassin -D </home/serv1/test.txt 2>&1 | grep dns | less" I get: 

*** 
[7336] dbg: dns: no ipv6 
[7336] dbg: dns: is Net::DNS::Resolver available? yes 
[7336] dbg: dns: Net::DNS version: 0.60 

[7336] dbg: dns: name server: 80.58.0.33, LocalAddr: 0.0.0.0 
[7336] dbg: dns: resolver socket rx buffer size is 124928 bytes 
[7336] dbg: dns: testing resolver nameservers: 80.58.0.33, 80.58.32.97, 208.67.222.222 
[7336] dbg: dns: trying (3) kernel.org... 
[7336] dbg: dns: looking up NS for 'kernel.org' 
[7336] dbg: dns: NS lookup of kernel.org using 80.58.0.33 succeeded => DNS available (set dns_available to override) 

[7305] dbg: dns: harvest_dnsbl_queries - check_tick 
[7305] dbg: dns: harvest_dnsbl_queries - check_tick 
[7305] dbg: dns: harvest_dnsbl_queries - check_tick 
[7305] dbg: dns: harvest_dnsbl_queries - check_tick 
[7305] dbg: dns: harvest_dnsbl_queries - check_tick 
[7305] dbg: dns: harvest_dnsbl_queries - check_tick 
[7305] dbg: dns: harvest_dnsbl_queries - check_tick 
[7305] dbg: dns: harvest_dnsbl_queries - check_tick 

[7305] dbg: async: aborting after 14.974 s, deadline shrunk: DNSBL-TXT, dns:TXT:15.35.17.212.sa-trusted.bondedsender.org. 
[7305] dbg: async: aborting after 14.973 s, deadline shrunk: DNSBL-A, dns:A:142.249.10.63.zen.spamhaus.org. 
[7305] dbg: async: aborting after 14.981 s, deadline shrunk: DNSBL-A, dns:A:142.249.10.63.dnsbl.sorbs.net. 
[7305] dbg: async: aborting after 14.989 s, deadline shrunk: DNSBL-TXT, dns:TXT:52.130.184.208.bl.spamcop.net. 
[7305] dbg: async: aborting after 14.994 s, deadline shrunk: DNSBL-A, dns:A:254.105.28.200.combined.njabl.org. 
[7305] dbg: async: aborting after 14.996 s, deadline shrunk: DNSBL-A, dns:A:15.35.17.212.plus.bondedsender.org. 
[7305] dbg: async: aborting after 14.978 s, deadline shrunk: DNSBL-A, dns:A:15.35.17.212.sa-accredit.habeas.com. 
[7305] dbg: async: aborting after 14.992 s, deadline shrunk: DNSBL-A, dns:A:213.21.195.128.combined.njabl.org. 
[7305] dbg: async: aborting after 14.989 s, deadline shrunk: DNSBL-TXT, dns:TXT:213.21.195.128.bl.spamcop.net. 
[7305] dbg: async: aborting after 14.970 s, deadline shrunk: DNSBL-A, dns:A:example.com.bl.open-whois.org. 
[7305] dbg: async: aborting after 14.993 s, deadline shrunk: DNSBL-A, dns:A:52.130.184.208.combined.njabl.org. 
[7305] dbg: async: aborting after 14.979 s, deadline shrunk: DNSBL-A, dns:A:213.21.195.128.dnsbl.sorbs.net. 

[7305] dbg: async: timing: 0.184 . dns:TXT:254.105.28.200.sa-other.bondedsender.org. 
[7305] dbg: async: timing: 14.968 X dns:A:15.35.17.212.iadb.isipp.com. 
[7305] dbg: async: timing: 14.969 X dns:A:example.com.fulldom.rfc-ignorant.org. 
[7305] dbg: async: timing: 14.970 X dns:A:example.com.bl.open-whois.org. 
[7305] dbg: async: timing: 14.971 X dns:A:213.21.195.128.zen.spamhaus.org. 
[7305] dbg: async: timing: 14.972 X dns:A:52.130.184.208.zen.spamhaus.org. 
[7305] dbg: async: timing: 14.972 X dns:A:254.105.28.200.zen.spamhaus.org. 
[7305] dbg: async: timing: 14.973 X dns:A:142.249.10.63.zen.spamhaus.org. 
[7305] dbg: async: timing: 14.974 X dns:TXT:15.35.17.212.sa-trusted.bondedsender.org. 
[7305] dbg: async: timing: 14.975 X dns:TXT:15.35.17.212.list.dsbl.org. 
*** 

SA installed version is "3.2.5". 

P.S. FYI, I posted this question on SA lists:

http://www.nabble.com/dnsbl-checks-time-out-td22222481.html
Comment 1 Karsten Bräckelmann 2009-02-27 08:02:52 UTC
(In reply to comment #0)
> I am unable to perform dnsbl checks. 

One test-point you mentioned on-list is, that you can do DNS lookups, like

  host spamhaus.org

Very well. Did you ever try reproducing the real lookups done by SA? Try something like this (taken from the debug output):

  host 142.249.10.63.zen.spamhaus.org


> P.S. FYI, I posted this question on SA lists:

Just for the records: Bugzilla is not an appropriate place for questions...

The sad truth is, that despite a lot of helpful posts and debugging on list, the mysterious issue at hand persists. As has been mentioned on-list, there isn't much to report here -- other than it doesn't work, whereas it even works on other SA machines Elsa uses. So I suggest to keep looking for differences between machines...

Unless we figure out a way to reproduce this, it remains a mystery limited to that  particular machine. Which eventually translates to WORKSFORME... *sigh*
Comment 2 Elsa Andrés 2009-02-27 09:51:05 UTC
(In reply to comment #1)

> One test-point you mentioned on-list is, that you can do DNS lookups, like
>   host spamhaus.org

> Very well. Did you ever try reproducing the real lookups done by SA? Try
> something like this (taken from the debug output):
>   host 142.249.10.63.zen.spamhaus.org

Running that I get no hit, but *neither* time out.

host 142.249.10.63.zen.spamhaus.org
Host 142.249.10.63.zen.spamhaus.org not found: 3(NXDOMAIN)

> Just for the records: Bugzilla is not an appropriate place for questions...

I know. I am not looking for "support".

I wrote this report based on previous SA bugs like these ones, that lead me to consider it could be a "real" problem here:

Refined async events handling and DNS lookup completions
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5589

Dns.pm: timeout zero while collecting DNS responses
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5511

As I am not a SA developer and have not clue nor documentation I can check to review the error messages I am getting. I think openning a bug report should be the right way.

> The sad truth is, that despite a lot of helpful posts and debugging on list,
> the mysterious issue at hand persists. As has been mentioned on-list, there
> isn't much to report here -- other than it doesn't work, whereas it even works
> on other SA machines Elsa uses. So I suggest to keep looking for differences
> between machines...
> Unless we figure out a way to reproduce this, it remains a mystery limited to
> that  particular machine. Which eventually translates to WORKSFORME... *sigh*

If there is no bug here, I will understand that decission.
Comment 3 Mark Martinec 2009-02-27 11:51:10 UTC
Perhaps it's time to see what is happening on the wire:

tcpdump -i your-aggregate-interface -s 0 -w 0.log \
  '(udp port 53 or icmp) and (host 80.58.0.33 or 80.58.32.97)'

while doing a test, then attach the binary log (or pastebin it somewhere).


> Net::DNS version: 0.60

That's rather old, the changelog to 0.65 is quite long.
Btw, which version of Perl is it?
Comment 4 Elsa Andrés 2009-02-27 13:00:21 UTC
(In reply to comment #3)

> Perhaps it's time to see what is happening on the wire:
> tcpdump -i your-aggregate-interface -s 0 -w 0.log \
>   '(udp port 53 or icmp) and (host 80.58.0.33 or 80.58.32.97)'
> while doing a test, then attach the binary log (or pastebin it somewhere).

I am attaching a zipped file with results of command "tcpdump -i bond0 -s 0 -w 0.txt port 53" and then retrieved with "tcpdump -r 0.txt"

> > Net::DNS version: 0.60
> That's rather old, the changelog to 0.65 is quite long.

Well, it's the same version I've got on another machine.

> Btw, which version of Perl is it?

perl -v
This is perl, v5.8.8 built for x86_64-linux-thread-multi

HTH.
Comment 5 Elsa Andrés 2009-02-27 13:01:28 UTC
Created attachment 4436 [details]
tcpdump
Comment 6 Mark Martinec 2009-02-27 14:33:50 UTC
I had a raw (binary) packet capture in mind. The textual summary
can cover interesting details, which might be important.

But it will do for now. It seems the queries are sent, but replies
are not coming back (assuming you left the capture running
sufficiently long time after running spamassassin, say a minute).

Looks like you need to do some network troubleshooting, like
looking at a DNS server's log file, trying the same capture
on the DNS host, checking firewall, network, ...
Comment 7 Elsa Andrés 2009-02-28 02:44:06 UTC
(In reply to comment #6)

> I had a raw (binary) packet capture in mind. The textual summary
> can cover interesting details, which might be important.
> But it will do for now. It seems the queries are sent, but replies
> are not coming back (assuming you left the capture running
> sufficiently long time after running spamassassin, say a minute).

Yes, no reply.

Strange thing is that making single queries, I get a reply from dns server:

***
host 213.21.195.128.zen.spamhaus.org
Host 213.21.195.128.zen.spamhaus.org not found: 3(NXDOMAIN)
***

and tcpdump...

***
11:27:00.621787 IP 10.0.0.11.32781 > 33.Red-80-58-0.staticIP.rima-tde.net.domain: 3709+ A? 213.21.195.128.zen.spamhaus.org. (49)
11:27:00.798656 IP 33.Red-80-58-0.staticIP.rima-tde.net.domain > 10.0.0.11.32781: 3709 NXDomain 0/1/0 (113)
11:27:00.798916 IP 10.0.0.11.32781 > 33.Red-80-58-0.staticIP.rima-tde.net.domain: 8369+ A? 213.21.195.128.zen.spamhaus.org. (49)
11:27:00.838842 IP 33.Red-80-58-0.staticIP.rima-tde.net.domain > 10.0.0.11.32781: 8369 NXDomain 0/1/0 (113)
11:27:05.621448 IP 10.0.0.11.32781 > 33.Red-80-58-0.staticIP.rima-tde.net.domain: 17060+ PTR? 1.0.0.10.in-addr.arpa. (39)
11:27:05.661448 IP 33.Red-80-58-0.staticIP.rima-tde.net.domain > 10.0.0.11.32781: 17060 NXDomain 0/1/0 (116)
11:29:06.996959 IP 10.0.0.11.32781 > 33.Red-80-58-0.staticIP.rima-tde.net.domain: 40187+ PTR? 1.0.0.10.in-addr.arpa. (39)
11:29:07.036717 IP 33.Red-80-58-0.staticIP.rima-tde.net.domain > 10.0.0.11.32781: 40187 NXDomain 0/1/0 (116)
***

So maybe there is something in the way SA makes dnsbl queries what it is causing this.

Is there any way I could tell SA to query just "one" dnsbl list? This way I could reproduce a single "host" query, which outside SA, works o.k.

> Looks like you need to do some network troubleshooting, like
> looking at a DNS server's log file, trying the same capture
> on the DNS host, checking firewall, network, ...

Well, I am trying all the tests I can perfom at my side (disabling firewalls -computer and router ones- disabling any filters that could be blocking packets and so on).

Once the packets leave my router I don't have any control. I am using my ISP dns servers.
Comment 8 Sidney Markowitz 2009-02-28 19:12:43 UTC
(in reply to comment #7)

> I am using my ISP dns servers.

One difference between what spamassassin does and running the host command for a single query is that spamassassin uses only the first DNS server on the list that responds to a test at startup time, while host will try servers in succession if the first one fails. If the first server on the list is flaky, it may respond to one of the test queries at initialization and then not later. Try configuring your machine with only one DNS server and try out each of your ISP's DNS servers one at at time to see if it makes any difference.

It probably won't make any difference, but just in case, try using the OpenDNS servers and see if it makes a difference. See http://wiki.apache.org/spamassassin/OpenDnsAndUribls first about turning off the OpenDNS typo correction feature or else RBLs other than SURBL and URIBL won't work.
Comment 9 Elsa Andrés 2009-03-01 04:14:57 UTC
(In reply to comment #8)

> One difference between what spamassassin does and running the host command for
> a single query is that spamassassin uses only the first DNS server on the list
> that responds to a test at startup time, while host will try servers in
> succession if the first one fails. If the first server on the list is flaky, it
> may respond to one of the test queries at initialization and then not later.

Yes, I knew that.

> Try configuring your machine with only one DNS server and try out each of your
> ISP's DNS servers one at at time to see if it makes any difference.
> It probably won't make any difference, but just in case, try using the OpenDNS
> servers and see if it makes a difference. See
> http://wiki.apache.org/spamassassin/OpenDnsAndUribls first about turning off
> the OpenDNS typo correction feature or else RBLs other than SURBL and URIBL
> won't work.

Well, I already made that test (that is, using opendns servers as well as the secondary one provided by my isp) and the problem remains. 

OTOH, I have no account with opendns, just use their dns servers, so I think have no way to disable anything.

I wonder if there is a way to test the resolver functions which SA uses for quering dnsbl servers. As per bug #5589 [Refined async events handling and DNS lookup completions], maybe something wrong happens in my machine due to my network configuration (bonded interfaces) when using ".pm" modules listed under /usr/lib/perl5/vendor_perl/5.8.8/Mail/SpamAssassin/

AsyncLoop.pm
Dns.pm
DnsResolver.pm

Could I test this ones outside SA to see if they are causing the "timeout" reponse I am getting? That should gives us a clue on where to start.
Comment 10 Sidney Markowitz 2009-03-01 19:42:53 UTC
(In reply to comment #9)
> OTOH, I have no account with opendns, just use their dns servers, so I think
> have no way to disable anything.

It's easy enough to create one, but that shouldn't make a difference anyway as you are getting no response, not the incorrect response that the typo correction would cause.

> Could I test this ones outside SA to see if they are causing the "timeout"
> reponse I am getting? That should gives us a clue on where to start.

It's been a while since I debugged that code, but I recall that when I had to see just what the resolver was doing and -D dns did not show enough detail I edited the call to new of the dns reolver object to enable its debug flag and used that to follow exactly what queries and replies were happening. Tedious to follow all that, but it helped when I needed it.
Comment 11 Mark Martinec 2009-03-02 04:24:39 UTC
> I am attaching a zipped file with results of command "tcpdump -i bond0 -s 0 -w
> 0.txt port 53" and then retrieved with "tcpdump -r 0.txt"

You left out the 'or icmp' part. A comfortable way to view such capture in detail is through wireshark.

> Is there any way I could tell SA to query just "one" dnsbl list? This way
> I could reproduce a single "host" query, which outside SA, works o.k.

This comes close (but there may still be a difference):

perl -le 'use Net::DNS; $r=Net::DNS::Resolver->new(nameservers=>
["80.58.0.33"]); $q=$r->query("213.21.195.128.dnsbl.sorbs.net","A");
if ($q) { ($q->answer)[0]->print } else { print $r->errorstring }'

or set scores for unwanted lookups to 0 (probably too much work
to find them all).

>> Looks like you need to do some network troubleshooting, like
>> looking at a DNS server's log file, trying the same capture
>> on the DNS host, checking firewall, network, ...
>
> Well, I am trying all the tests I can perfom at my side (disabling firewalls
> -computer and router ones- disabling any filters that could be blocking
> packets and so on).
> Once the packets leave my router I don't have any control.
> I am using my ISP dns servers.

With multi giga links and several hosts in the network, along with inability
to have any control over provider's DNS server (their UDP rate limits,
firewall, NAT, looking into their DNS log, doing a packet capture at a DNS
host), I really believe you should set up your own DNS resolving server,
at least temporarily to troubleshoot this issue. I see no good reason to
depend on third party DNS resolver and its potential quirks with anything
beyond a trivial home network. Installing a 'bind' (or other server) in
caching-only mode should be out-of-the-box. Just don't set any forwarders,
otherwise you are likely back to the same issue.
Comment 12 Elsa Andrés 2009-03-02 04:40:09 UTC
(In reply to comment #10)

> It's been a while since I debugged that code, but I recall that when I had to
> see just what the resolver was doing and -D dns did not show enough detail I
> edited the call to new of the dns reolver object to enable its debug flag and
> used that to follow exactly what queries and replies were happening. Tedious to
> follow all that, but it helped when I needed it.

Is there any way I can make the debugger more verbose? Any flag I can activate
in ".pm" modules or anything I can test on my side...?

I just tried one of the examples listed ond Net::DNS and, AFAICT, it seems
working:

test.pl

***
#!/usr/bin/perl

use Net::DNS;
         my $res   = Net::DNS::Resolver->new;
         my $query = $res->search("52.130.184.208.zen.spamhaus.org");

         if ($query) {
             foreach my $rr ($query->answer) {
                 next unless $rr->type eq "A";
                 print $rr->address, "\n";
             }
         } else {
             warn "query failed: ", $res->errorstring, "\n";
         }
***

When I run it, it gives me the following response:

./test.pl
216.234.246.150

And tcpdump get it also:

***
13:28:55.686573 IP 10.0.0.11.32853 >
33.Red-80-58-0.staticIP.rima-tde.net.domain: 17777+ A?
52.130.184.208.zen.spamhaus.org. (49)
13:28:55.726577 IP 33.Red-80-58-0.staticIP.rima-tde.net.domain >
10.0.0.11.32853: 17777 NXDomain 0/1/0 (113)
13:28:55.729749 IP 10.0.0.11.32853 >
33.Red-80-58-0.staticIP.rima-tde.net.domain: 28412+ A?
52.130.184.208.zen.spamhaus.org.net. (53)
13:28:55.770007 IP 33.Red-80-58-0.staticIP.rima-tde.net.domain >
10.0.0.11.32853: 28412 1/0/0 A 96.f6.ead8.static.theplanet.com (69)
***

So I cannot guess why anyting seems to be resolving o.k. but SA :-?
Comment 13 Karsten Bräckelmann 2009-03-02 05:15:51 UTC
(In reply to comment #12)
> my $query = $res->search("52.130.184.208.zen.spamhaus.org");

> When I run it, it gives me the following response:
> 
> ./test.pl
> 216.234.246.150

Huh?  That doesn't look like a Spamhaus response...

Host 52.130.184.208.zen.spamhaus.org not found: 3(NXDOMAIN)

150.246.234.216.in-addr.arpa domain name pointer
  96.f6.ead8.static.theplanet.com.
Comment 14 Matt Kettler 2009-03-02 05:41:23 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > my $query = $res->search("52.130.184.208.zen.spamhaus.org");
> 
> > When I run it, it gives me the following response:
> > 
> > ./test.pl
> > 216.234.246.150
> 
> Huh?  That doesn't look like a Spamhaus response...
> 

At casual glance it looks like your ISP or DSN provider is running a DNS over-ride that returns a search website every time you enter an invalid domain. 

Verizon does this by default, it's evil. The "search site" returned is essentially a giant for-pay ad.

More and more "end user" ISPs are doing this in order to try to get a little extra revenue, but it surprises me that a business class carrier would do it.




Comment 15 Karsten Bräckelmann 2009-03-02 06:48:13 UTC
Even more strange, the IP returned as per comment 12 looks like an actual end-user IP. The web-server currently claims "It works!". ;)

If the test script really returned that IP for Elsa, clearly something seriously is broken. And it's not SA. :)
Comment 16 Elsa Andrés 2009-03-02 08:01:21 UTC
(In reply to comment #15)

> Even more strange, the IP returned as per comment 12 looks like an actual
> end-user IP. The web-server currently claims "It works!". ;)
> If the test script really returned that IP for Elsa, clearly something
> seriously is broken. And it's not SA. :)

Well, not that fast, sir :-)

Sure there is something weird but this has nothing to do with a "timeout" from SA. If I get a response from a server (whatever response) I cannot get a timeout for this query, right? That is the key. 

Anyway, a query on a blacklisted IP is returning a positive (standard) response:

***
host 50.33.213.190.dnsbl.sorbs.net
50.33.213.190.dnsbl.sorbs.net has address 127.0.0.7
***

So, nothing strange here.
Comment 17 Mark Martinec 2009-03-02 08:32:20 UTC
> > > When I run it, it gives me the following response:
> > > ./test.pl
> > > 216.234.246.150
> > 
> > Huh?  That doesn't look like a Spamhaus response...
> 
> At casual glance it looks like your ISP or DSN provider is running a DNS
> over-ride that returns a search website every time you enter an invalid domain. 
> Verizon does this by default, it's evil. The "search site" returned is
> essentially a giant for-pay ad.
> More and more "end user" ISPs are doing this in order to try to get a little
> extra revenue, but it surprises me that a business class carrier would do it.

The 216.234.246.150 result here is easy to explain, the test program
failed to turn off the search list:  $res->dnsrch(0)  before the query,
so Net::DNS sent out two queries, for:
52.130.184.208.zen.spamhaus.org,  and for:
52.130.184.208.zen.spamhaus.org.net

The second query returned the 216.234.246.150.

This is not happening when Net::DNS is called by SpamAssassin,
the test was a red herring.

Comment 18 Karsten Bräckelmann 2009-03-02 09:01:52 UTC
(In reply to comment #16)
> host 50.33.213.190.dnsbl.sorbs.net
> 50.33.213.190.dnsbl.sorbs.net has address 127.0.0.7

Yeah, I meant to add "not a timeout, though" to my previous comment, but forgot when writing. My bad.  However, we already know 'host' works -- you where trying to debug Perl, imitating SA behavior, right?


(In reply to comment #17)
> The second query returned the 216.234.246.150.

Oh, how annoying.
Comment 19 Elsa Andrés 2009-03-02 10:06:07 UTC
(In reply to comment #18)

> Yeah, I meant to add "not a timeout, though" to my previous comment, but forgot
> when writing. My bad.  However, we already know 'host' works -- you where
> trying to debug Perl, imitating SA behavior, right?

Yes. Sorry if I was a bit rude :-)

Well, in the way of debugging Perl (as comment #11 suggested) I have found a script that can help a bit, JFYR. The script is as follows:

***
#!/usr/bin/perl

use Net::DNS::Resolver;

my $res = Net::DNS::Resolver->new(
               nameservers => [qw(80.58.0.33)], # specify your own here
               udp_timeout => 2,
               retry       => 1,
               debug       => 1,
              );

my $host = '213.21.195.128.dnsbl.sorbs.net';

if (my $pkt = $res->query($host)) {
         for my $answer ( $pkt->answer() ) {
             my $type = $answer->type();
             if ($type eq "PTR") {
                 print $answer->ptrdname(), "\n";
             } elsif ($type eq "A") {
                 print $answer->name(), "\n";
             }
         }
}
***

And what I get after running it:

./test.pl

***
;; query(213.21.195.128.dnsbl.sorbs.net)
;; setting up an AF_INET() family type UDP socket
;; send_udp(80.58.0.33:53)
;; answer from 80.58.0.33:53 : 104 bytes
;; HEADER SECTION
;; id = 61344
;; qr = 1    opcode = QUERY    aa = 0    tc = 0    rd = 1
;; ra = 1    ad = 0    cd = 0    rcode  = NXDOMAIN
;; qdcount = 1  ancount = 0  nscount = 1  arcount = 0

;; QUESTION SECTION (1 record)
;; 213.21.195.128.dnsbl.sorbs.net.      IN      A

;; ANSWER SECTION (0 records)

;; AUTHORITY SECTION (1 record)
dnsbl.sorbs.net.        600     IN      SOA     rbldns0.sorbs.net. dns.isux.com. (
                                        1236012948      ; Serial
                                        7200    ; Refresh
                                        7200    ; Retry
                                        604800  ; Expire
                                        3600 )  ; Minimum TTL

;; ADDITIONAL SECTION (0 records)
***

And tcpdump:

***
18:56:24.985631 IP 10.0.0.11.32931 > 33.Red-80-58-0.staticIP.rima-tde.net.domain: 29368+ A? 213.21.195.128.dnsbl.sorbs.net. (48)
18:56:25.025395 IP 33.Red-80-58-0.staticIP.rima-tde.net.domain > 10.0.0.11.32931: 29368 NXDomain 0/1/0 (104)
***

Still getting a good response. I cannot reproduce de "timeout" I get with SA, just only through SA.
Comment 20 Mark Martinec 2009-03-02 10:42:56 UTC
> Well, in the way of debugging Perl (as comment #11 suggested) I have
> found a script that can help a bit, JFYR. The script is as follows:

It sounds like a waste of time trying to debug code generating queries,
when you already know (from a packet capture) that queries are
generated and are being sent, and queries do look normal (did you
check the raw capture with wireshark?).

Perhaps it's a burst of packets that is being blocked, maybe it's
NAT/PAT borking up sessions, maybe it's one of the bonded wires
that is eating packets, ...   Try to see what is happening at
the DNS server's end. Set up your own, if you have no access to
your provider's resources. 
Comment 21 Sidney Markowitz 2009-03-02 14:25:35 UTC
(In reply to comment #12)
> Is there any way I can make the debugger more verbose? Any flag I can activate
> in ".pm" modules or anything I can test on my side...?

As I recall, when I was debugging that code there were the dbg statements in SpamAssassin that are activated with -D dns and are specifically no more verbose than would be useful for general troubleshooting with debug logs, the debug option in the NET::DNS::Resolver constructor, and then for more detail after that I had to insert my own debug output statements that I removed once I found the bug. No shortcuts to the usual step-by-step debugging.

After reading all the other comments after comment #12, I would agree that your best bet is to set up your own local caching DNS server. Either the problem will go away, in which case you can attribute it to the complexities of your ISP's name server setup, or else it will continue and you can look closely at what SpamAssassin and your DNS server are doing. The only other approach I see is digging deep into NET::DNS::Resolver to see exactly where the packets go and how they come back.

I would give up on trying to extract a piece of perl code to call Net::DNS::Resolver the same way as SpamAssassin is doing. SpamAssassin has a very different task then most clients built on top of Net::DNS::Resolver because it is sending out many queries that are to be dealt with asynchronously as they come back. The way it uses Net::DNS::Resolver is going to be very different from simple test harnesses that call Net::DNS::Resolver.
Comment 22 Elsa Andrés 2009-03-03 01:46:56 UTC
(In reply to comment #21)

> After reading all the other comments after comment #12, I would agree that your
> best bet is to set up your own local caching DNS server. 

O.K. Let's go "named" :-).

Well, named installed and configured with no forwarders and /etc/resolv.conf left with only one server (10.0.0.11).

I'm attaching the logs for your review and wait for your advice on this (I am not sure on how to interpret it, even no hit, it seems right). So, if you want me to do some other tests, just tell me. If you consider the log is all O.K. and there is nothing wrong there, you can close this bug as resolved/invalid :-)

BTW, although now I am using my own dns server, I keep getting that "fancy" response when making such queries:

***
host 52.130.184.208.combined.njabl.org
52.130.184.208.combined.njabl.org.net has address 216.234.246.150
***

This is no normal, and I have to speak with my isp to take a look into this. Maybe a firewall or any kind of transparent proxy on their side is filtering the results directly into this line/route, not just dns. But this surely has nothing to do with SA.
Comment 23 Elsa Andrés 2009-03-03 01:48:12 UTC
Created attachment 4437 [details]
SA test with named
Comment 24 Mark Martinec 2009-03-03 04:00:04 UTC
> O.K. Let's go "named" :-).
> 
> Well, named installed and configured with no forwarders and /etc/resolv.conf
> left with only one server (10.0.0.11).
> 
> I'm attaching the logs for your review and wait for your advice on this (I am
> not sure on how to interpret it, even no hit, it seems right). So, if you want
> me to do some other tests, just tell me. If you consider the log is all O.K.
> and there is nothing wrong there, you can close this bug as resolved/invalid
> :-)

The log looks perfect, you got DNS responses to all of your 25 DNS queries,
all of them within 0.8 seconds of sending a query.


> BTW, although now I am using my own dns server, I keep getting that "fancy"
> response when making such queries: 
>   host 52.130.184.208.combined.njabl.org
>   52.130.184.208.combined.njabl.org.net has address 216.234.246.150
> This is no normal, and I have to speak with my isp to take a look into this.
> Maybe a firewall or any kind of transparent proxy on their side is filtering
> the results directly into this line/route, not just dns. But this surely has
> nothing to do with SA.

This is a consequence of two things:

- in your /etc/resolv.conf you are telling your resolver to try appending
  one or more fields to the domain name if the attempt with original
  name is unsuccessful. Get rid of your existing 'domain' and 'search'
  directives in that file, and supply just a 'search' with no arguments.
  Or at least do not specify something as vague as 'net' for your domain!
  (according to your tcpdump in #12), but use your exact domain name in a
  'search' option, and do not use a 'domain' option.
  
- the 'org.net' domain you have queried (appended .net to your ...njabl.org)
  is registered by Cyberfusion / OKDIRECT.COM.  Such domains attempt to
  fish for traffic from innocent users which make a typo when enetering an
  URL, or sites like yours, which append '.net' to unsuccessful queries.
  Gratuitous advertising. Don't let them have your mistyped queries!


Btw, I have another hypothesis why your original queries to ISP may be failing.
You have 2 Gbps links from your host - but do you have gigabit connectivity
all the way to your service provider? If your uplink is thin, you should have
some traffic shaping with reasonable buffer size on your device sitting just
before a bottleneck link - usually a firewall or a router. If this is some
dumb device, your queries may have been victims of tail drop in a switch.
Note that SpamAssassin made 29 queries in rapid succession, all within 22
milliseconds. A solution in such scenario would be to turn on traffic
shaper on firewall/router or on your host where SpamAssassin is running.
But never mind, the solution with a caching name server within your
own network is much better.
Comment 25 AXB 2009-03-03 04:11:17 UTC
(In reply to comment #24)
> Btw, I have another hypothesis why your original queries to ISP may be failing.
> You have 2 Gbps links from your host - but do you have gigabit connectivity
> all the way to your service provider? If your uplink is thin, you should have
> some traffic shaping with reasonable buffer size on your device sitting just
> before a bottleneck link - usually a firewall or a router. If this is some
> dumb device, your queries may have been victims of tail drop in a switch.
> Note that SpamAssassin made 29 queries in rapid succession, all within 22
> milliseconds. A solution in such scenario would be to turn on traffic
> shaper on firewall/router or on your host where SpamAssassin is running.
> But never mind, the solution with a caching name server within your
> own network is much better.

FOR THE RECORD:

Some Zyxel DSL/firewall devices have a built in "throttling" feature with minimal settings which cannot be disabled and will drop high UDP traffic to the outside.

Also seen on some other consumer type CableTV modems and cheapo hardware.


a burst of simultaneous DNS queries will trigger the block/drop, erratically and its not easy to debug.


Comment 26 Elsa Andrés 2009-03-03 05:38:39 UTC
(In reply to comment #24)

> The log looks perfect, you got DNS responses to all of your 25 DNS queries,
> all of them within 0.8 seconds of sending a query.

Alright then, sirs. Closing this as "invalid" and sorry for the noise. Never seen this before :-)

> This is a consequence of two things:

> - in your /etc/resolv.conf you are telling your resolver to try appending
>   one or more fields to the domain name if the attempt with original
>   name is unsuccessful. Get rid of your existing 'domain' and 'search'
>   directives in that file, and supply just a 'search' with no arguments.
>   Or at least do not specify something as vague as 'net' for your domain!
>   (according to your tcpdump in #12), but use your exact domain name in a
>   'search' option, and do not use a 'domain' option.

Indeed, that seems to be the culprit. After commented out that sentence ("search net") it returned the expected response (nxdomain).

> - the 'org.net' domain you have queried (appended .net to your ...njabl.org)
>   is registered by Cyberfusion / OKDIRECT.COM.  Such domains attempt to
>   fish for traffic from innocent users which make a typo when enetering an
>   URL, or sites like yours, which append '.net' to unsuccessful queries.
>   Gratuitous advertising. Don't let them have your mistyped queries!

> Btw, I have another hypothesis why your original queries to ISP may be failing.
> You have 2 Gbps links from your host - but do you have gigabit connectivity
> all the way to your service provider? If your uplink is thin, you should have
> some traffic shaping with reasonable buffer size on your device sitting just
> before a bottleneck link - usually a firewall or a router. If this is some
> dumb device, your queries may have been victims of tail drop in a switch.
> Note that SpamAssassin made 29 queries in rapid succession, all within 22
> milliseconds. A solution in such scenario would be to turn on traffic
> shaper on firewall/router or on your host where SpamAssassin is running.
> But never mind, the solution with a caching name server within your
> own network is much better.

It makes sense. 

I had in mind to perform a "silly" test I was thinking on these past days: to connect the host with a dial-up modem and see what happens, just to eliminate 2 things, a) dsl router and b) dsl line themselves. This server is running over static ip address so I am binded to it. By using a dial-up modem (just for testing purposes) I would get interesting results, in a way or another.

I don't mind to have a "named" service on this machine if just SA (or any other service) need it (indeed, bind9 is a good partner), just wanted to be sure where the problem was.

Thanks to all of you for helping me to debug this.
Comment 27 Sidney Markowitz 2009-03-03 07:53:53 UTC
(in reply to comment #26)
> Closing this as "invalid" and sorry for the noise

No problem, and in fact thanks for sticking with it this long. i'm very happy to have documented in the record these difficult to debug problems that could arise from a high speed connection throttled by a lower speed upstream, or by high rate UDP getting throttled by a Zyxel or similarly performing router.
Comment 28 Mark Martinec 2009-03-03 16:43:36 UTC
> No problem, and in fact thanks for sticking with it this long. i'm very happy
> to have documented in the record these difficult to debug problems that could
> arise from a high speed connection throttled by a lower speed upstream, or by
> high rate UDP getting throttled by a Zyxel or similarly performing router.

I wonder how common this problem is. Thanks to Elsa for perseverance!
How many SOHO users of SpamAssassin suffer from this without knowing?

It might be possible to implement application-level traffic shaping
(leaky bucket or token bucket algorithm) for taming UDP storms.
The main problem seems to be lack of fine granularity in tasks of SA
(and the absence of fine granularity interrupts in Perl). Something
like a 10 ms scheduling event would do for the purpose, and could also
help with more prompt harvesting of responses (Bug 5589 comment 7).
A Bug 6060 calls for shorter chunks of noninterrupted grinding too.
Just thinking aloud...
Comment 29 AXB 2009-03-03 22:24:55 UTC
(In reply to comment #28)
> > No problem, and in fact thanks for sticking with it this long. i'm very happy
> > to have documented in the record these difficult to debug problems that could
> > arise from a high speed connection throttled by a lower speed upstream, or by
> > high rate UDP getting throttled by a Zyxel or similarly performing router.
> 
> I wonder how common this problem is. Thanks to Elsa for perseverance!
> How many SOHO users of SpamAssassin suffer from this without knowing?

Please note :

This has been noticed/debugged in a SOHO setup with unusually high traffic
(+80 simultaneous SMTP sessions) doing inline scanning and not using sequential lookups as would happen with a MDA. (specifically: spam trap boxes)

Hardware was +5 year old and not ready for today's high speed connections.
Updated hardware does not show this effect.
In one case, where a custoemr was affected, reporting to Zyxel, resulted in a firmware update which corrected the issue.

> It might be possible to implement application-level traffic shaping
> (leaky bucket or token bucket algorithm) for taming UDP storms.
> The main problem seems to be lack of fine granularity in tasks of SA
> (and the absence of fine granularity interrupts in Perl). Something
> like a 10 ms scheduling event would do for the purpose, and could also
> help with more prompt harvesting of responses (Bug 5589 comment 7).
> A Bug 6060 calls for shorter chunks of noninterrupted grinding too.
> Just thinking aloud...

IMO, adding throttling won't fix the issue as timeframes may vary in such a wide way that it would only make debugging harder.