SA Bugzilla – Bug 6926
cw, sx TLDs (and possibly others) not recognized
Last modified: 2014-03-10 12:20:57 UTC
http://www.iana.org/domains/root/db lists a couple ccTLDs that are not recognized by SpamAssassin. The two that occurred to me are cw and sx. I think they should be added to RegistrarBoundaries.pm. Perhaps a comprehensive sync with IANA is in order? Do they provide their TLD list in machine readable format?
> Do they provide their TLD list in machine readable format? How about: $ wget -qqO- https://www.iana.org/domains/root/db \ |perl -ne 'if (m"/root/db/([^.]+)\.html") { print "$1\n" }' \ > tld.txt After that, you can run: $ sed '/^ ac ad/,/^ zm zw/!d; s/^ //; s/ */\n/g' \ lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm \ |grep -vwFf- tld.txt Which currently reveals we're missing: bl bq bv cw eh gb mf post sj ss sx um (plus all the punycode IDNs, unless we track them elsewhere) (I also ran the opposite. We don't have any TLDs that aren't on IANA's list.) We'll have to add these via util_rb_tld in sa-update in addition to RegistrarBoundaries.pm so users don't have to wait for SA 3.4.0 to get this. While on the ~tld topic, I see we don't yet include https://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 (for 2tld and 3tld). I haven't vetted that to see if it's worthwhile, but in doing some research a while back, it looked ideal.
(In reply to comment #1) > Which currently reveals we're missing: > > bl bq bv cw eh gb mf post sj ss sx um Ah, now I see the comment in RegistrarBoundaries.pm: # The following have been removed from the list because they are # inactive, as can be seen in the Wikipedia articles about them # as of 2008-02-08, e.g. http://en.wikipedia.org/wiki/.so_%28domain_name%29 # bv gb pm sj so um yt A quick summary of all the candidates: bl - Saint Barthélemy (French), reserved but unassigned bq - Carribean Netherlands, designated but not yet used bv - Bouvet Island (Norway), reserved and sponsored but unused cw - Curaçao, new and in use, http://www.una.cw/cw_registry/ eh - Western Sahara, no recognized government, reserved but not in use gb - Great Britain, abandoned except ~3 hosts under .dra.hmg.gb mf - Saint Martin (French), reserved but unassigned post - Universal Postal Union (snail mail), new as of August sj - Svalbard and Jan Mayen (Norway), reserved and sponsored but unused ss - South Sudan, new and registered, still pending (nation formed 2011-07) sx - Sint Maarten (Netherlands), new, in use (open), http://registry.sx/ um - US Minor Outlying Islands (USA), revoked (why is it listed?) It looks like it's still valid to avoid bv gb sj and um. We should add cw and post and sx. The others may be on their way, but don't need inclusion quite yet. Given how .sx has open registration and looks like "sex," I expect it to attract porn sites, which means spam. We definitely want to include that one. You can find these on wikipedia like https://en.wikipedia.org/wiki/.so (note the dot, also note that the disambiguation page from 2008 is gone) and similarly at IANA like https://www.iana.org/domains/root/db/so.html
Periodically walking the root zone's NSEC-RR list may be the best way to regenerate the TLD part of the list, especially as we only care about currently resolvable TLDs. From there, the automated process would may any necessary modifications (would these all be additions?).
We are seeing SPOOF_COM2COM and SPOOF_COM2OTH being triggered by domains on .org.pe
(In reply to comment #4) > We are seeing SPOOF_COM2COM and SPOOF_COM2OTH being triggered by domains on > .org.pe not relevant to this bug please use the SA users list.
never mind, .pe isn't a new domain. :P
Per comment #2, will anything be done about cw and sx?
Added the two mentioned TLDs with revision 1575917. Higher-level improvements to RegistrarBoundaries.pm likely belong on bug 6782. I've cross-referenced this ticket, so we can continue discussion there.