Bug 6926 - cw, sx TLDs (and possibly others) not recognized
Summary: cw, sx TLDs (and possibly others) not recognized
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P2 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-09 00:02 UTC by Julian Mehnle
Modified: 2014-03-10 12:20 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Julian Mehnle 2013-04-09 00:02:43 UTC
http://www.iana.org/domains/root/db lists a couple ccTLDs that are not recognized by SpamAssassin.  The two that occurred to me are cw and sx.  I think they should be added to RegistrarBoundaries.pm.  Perhaps a comprehensive sync with IANA is in order?  Do they provide their TLD list in machine readable format?
Comment 1 Adam Katz 2013-04-11 01:13:39 UTC
> Do they provide their TLD list in machine readable format?

How about:

    $ wget -qqO- https://www.iana.org/domains/root/db \
      |perl -ne 'if (m"/root/db/([^.]+)\.html") { print "$1\n" }' \
      > tld.txt

After that, you can run:

    $ sed '/^  ac ad/,/^  zm zw/!d; s/^  //; s/  */\n/g' \
      lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm \
        |grep -vwFf- tld.txt

Which currently reveals we're missing:

    bl bq bv cw eh gb mf post sj ss sx um
    (plus all the punycode IDNs, unless we track them elsewhere)

(I also ran the opposite.  We don't have any TLDs that aren't on IANA's list.)

We'll have to add these via util_rb_tld in sa-update in addition to RegistrarBoundaries.pm so users don't have to wait for SA 3.4.0 to get this.

While on the ~tld topic, I see we don't yet include https://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 (for 2tld and 3tld).  I haven't vetted that to see if it's worthwhile, but in doing some research a while back, it looked ideal.
Comment 2 Adam Katz 2013-04-11 02:24:05 UTC
(In reply to comment #1)
> Which currently reveals we're missing:
> 
>     bl bq bv cw eh gb mf post sj ss sx um

Ah, now I see the comment in RegistrarBoundaries.pm:

# The following have been removed from the list because they are
# inactive, as can be seen in the Wikipedia articles about them
# as of 2008-02-08, e.g. http://en.wikipedia.org/wiki/.so_%28domain_name%29
#     bv gb pm sj so um yt

A quick summary of all the candidates:

bl - Saint Barthélemy (French), reserved but unassigned
bq - Carribean Netherlands, designated but not yet used
bv - Bouvet Island (Norway), reserved and sponsored but unused
cw - Curaçao, new and in use, http://www.una.cw/cw_registry/
eh - Western Sahara, no recognized government, reserved but not in use
gb - Great Britain, abandoned except ~3 hosts under .dra.hmg.gb
mf - Saint Martin (French), reserved but unassigned
post - Universal Postal Union (snail mail), new as of August
sj - Svalbard and Jan Mayen (Norway), reserved and sponsored but unused
ss - South Sudan, new and registered, still pending (nation formed 2011-07)
sx - Sint Maarten (Netherlands), new, in use (open), http://registry.sx/
um - US Minor Outlying Islands (USA), revoked (why is it listed?)

It looks like it's still valid to avoid bv gb sj and um.  We should add cw and post and sx.  The others may be on their way, but don't need inclusion quite yet.

Given how .sx has open registration and looks like "sex," I expect it to attract porn sites, which means spam.  We definitely want to include that one.

You can find these on wikipedia like https://en.wikipedia.org/wiki/.so (note the dot, also note that the disambiguation page from 2008 is gone) and similarly at IANA like https://www.iana.org/domains/root/db/so.html
Comment 3 D. Stussy 2013-04-25 01:01:14 UTC
Periodically walking the root zone's NSEC-RR list may be the best way to regenerate the TLD part of the list, especially as we only care about currently resolvable TLDs.  From there, the automated process would may any necessary modifications (would these all be additions?).
Comment 4 Quanah Gibson-Mount 2013-04-25 22:10:23 UTC
We are seeing SPOOF_COM2COM and SPOOF_COM2OTH being triggered by domains on .org.pe
Comment 5 AXB 2013-04-25 22:13:05 UTC
(In reply to comment #4)
> We are seeing SPOOF_COM2COM and SPOOF_COM2OTH being triggered by domains on
> .org.pe

not relevant to this bug

please use the SA users list.
Comment 6 Quanah Gibson-Mount 2013-04-25 22:17:14 UTC
never mind, .pe isn't a new domain. :P
Comment 7 Julian Mehnle 2014-03-06 17:40:02 UTC
Per comment #2, will anything be done about cw and sx?
Comment 8 Joe Quinn 2014-03-10 12:20:57 UTC
Added the two mentioned TLDs with revision 1575917.

Higher-level improvements to RegistrarBoundaries.pm likely belong on bug 6782. I've cross-referenced this ticket, so we can continue discussion there.