Bug 6782 - TLD update handling / RegistrarBoundaries need improving
Summary: TLD update handling / RegistrarBoundaries need improving
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P1 blocker
Target Milestone: 3.4.1
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-01 12:41 UTC by Henrik Krohns
Modified: 2015-04-09 17:41 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Complete diff vs current trunk patch None Henrik Krohns [HasCLA]
All changed files for easy extraction/testing application/octet-stream None Henrik Krohns [HasCLA]
The cf that we might put into sa-update, for easy viewing here.. text/plain None Henrik Krohns [HasCLA]
All changed files for easy extraction/testing application/gzip None Henrik Krohns [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Henrik Krohns 2012-04-01 12:41:54 UTC
We have config options util_rb_tld, 2tld etc. What these do is add values to Mail::SpamAssassin::Util::RegistrarBoundaries::VALID_TLDS etc from Conf.pm.

Problem is that only some very internal SA functions can make use of the updates. I couldn't figure out how this exactly works. Atleast HTML parser etc see them, since URIBL uses those results and util_rb_tld does query new domains/TLDs using it (for uris with scheme).

Anything that uses the static $VALID_TLDS_RE will not see updates. This includes schemeless uri parsing (so URIBL does not work for those) and FreeMail.

Anything that uses Mail::SpamAssassin::Util::RegistrarBoundaries "directly" will not see those updates. Some suspect code:

./Plugin/URIDNSBL.pm:        $nsrhblstr = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($nsmatch);
./Plugin/HTTPSMismatch.pm:        $uri = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($uri);
./Plugin/HTTPSMismatch.pm:          undef $uri unless (Mail::SpamAssassin::Util::RegistrarBoundaries::is_domain_valid($uri));
./Plugin/HTTPSMismatch.pm:          $https = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($https);
./Plugin/HTTPSMismatch.pm:            undef $https unless (Mail::SpamAssassin::Util::RegistrarBoundaries::is_domain_valid($https));
./Plugin/WLBLEval.pm:    Mail::SpamAssassin::Util::RegistrarBoundaries::split_domain($addr_domain);
./Plugin/WLBLEval.pm:      Mail::SpamAssassin::Util::RegistrarBoundaries::split_domain($relay_rdns);
./Plugin/HeaderEval.pm:    $dom = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($dom);
./Plugin/HeaderEval.pm:        (Mail::SpamAssassin::Util::RegistrarBoundaries::is_domain_valid($dom));
./Util.pm:    $uri = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($uri);
./Util.pm:        (Mail::SpamAssassin::Util::RegistrarBoundaries::is_domain_valid($uri));

Some serious reorganization is needed if we want to properly support TLD list updates via sa-update.
Comment 1 Joe Quinn 2014-03-10 12:19:55 UTC
Cross-referencing discussion from Bug 6926:

> D. Stussy 2013-04-25 01:01:14 UTC
> Periodically walking the root zone's NSEC-RR list may be the best way to regenerate the TLD part of the list, especially as we only care about currently resolvable TLDs.  From there, the automated process would may any necessary modifications (would these all be additions?).

> Adam Katz 2013-04-11 01:13:39 UTC 
> While on the ~tld topic, I see we don't yet include https://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 (for 2tld and 3tld).  I haven't vetted that to see if it's worthwhile, but in doing some research a while back, it looked ideal.

In addition to the Mozilla DB, IANA has a list of TLDs in similar format here:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt

While I am in favor of removing $VALID_TLDS_RE, there are users who depend on it existing, so I don't think pushing TLDs through sa-update will be a good way to go.
Comment 2 Henrik Krohns 2014-09-17 22:55:18 UTC
Ok I steamed all evening to come up with something. I'll upload the changes/diffs here before committing a large blob like this.. reviews, suggestions, votes welcomed..

- Let's deprecate Util/RegistrarBoundaries.pm completely, but keep it bundled for now because of third party plugins etc. Maybe delete all the leftovers on 3.5 release.

- Implement Mail::SpamAssassin::RegistryBoundaries (now is a good time to "fix" the name :-D), basically the same code just copied

- Only way I managed to make the tld lists work dynamically everywhere is load RegistryBoundaries in the root SpamAssassin.pm (even before plugins are initialized, so plugins can se all the tlds there)

- Basically replaced Mail::SpamAssassin::Util::RegistrarBoundaries::xxxx_domain() calls everywhere with $self->{main}->{registryboundaries}->xxxx_domain()

- Conf.pm (temporarily) includes hardcoded list of domains, so things still work before sa-update is done to get the updated 20_aux_tlds.cf, which now includes all TLDs (new function clear_util_rb makes sure hardcoded values are ignored)

- Mail::SpamAssassin::Util::uri_to_domain is deprecated and moved to RegistryBoundaries since it needs the dynamic $self variables

Everything should work without breaking anything on any version.. maybe I missed some scenario so check it out. :-)
Comment 3 Henrik Krohns 2014-09-17 22:56:15 UTC
Created attachment 5235 [details]
Complete diff vs current trunk
Comment 4 Henrik Krohns 2014-09-17 22:58:00 UTC
Created attachment 5236 [details]
All changed files for easy extraction/testing

lib/Mail/SpamAssassin/Conf.pm
lib/Mail/SpamAssassin/PerMsgStatus.pm
lib/Mail/SpamAssassin/Plugin/FreeMail.pm
lib/Mail/SpamAssassin/Plugin/HTTPSMismatch.pm
lib/Mail/SpamAssassin/Plugin/HeaderEval.pm
lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm
lib/Mail/SpamAssassin/Plugin/URIEval.pm
lib/Mail/SpamAssassin/Plugin/WLBLEval.pm
lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm
lib/Mail/SpamAssassin/Util.pm
lib/Mail/SpamAssassin.pm
rules/20_aux_tlds.cf
t/ip_addrs.t
t/uri.t

(extract these over svn and make test; make install)
Comment 5 Henrik Krohns 2014-09-17 23:00:00 UTC
Created attachment 5237 [details]
The cf that we might put into sa-update, for easy viewing here..
Comment 6 Kevin A. McGrail 2014-09-18 02:15:17 UTC
I was working on this as well and my thoughts were to deliver the data to then process using new dependencies limiting the changes to RegistrarBoundaries.pm.  I need to look but your change seems far more intrusive.
Comment 7 Henrik Krohns 2014-09-18 04:29:09 UTC
Feel free to describe what you had in mind? Since Util-stuff do not have access by design to any configuration or state data, I have a hard time imagining how this would effectively work..
Comment 8 Henrik Krohns 2014-09-20 05:33:31 UTC
Created attachment 5238 [details]
All changed files for easy extraction/testing

Oops, last package was missing the obvious lib/Mail/SpamAssassin/RegistryBoundaries.pm ... also cp rules/20_aux_tlds.cf manually in place after sa-update.
Comment 9 AXB 2014-09-22 12:08:53 UTC
Running SA/spamd on Centos 6.5 with these patches on a trap box and see no issues.

make test: OK

Freemail.pm: OK
URIBL.pm lookups: ok
white/blacklist_from/etc: OK

due to low traffic I haven't been able to test

whitelist_uri_host
blacklist_uri_host

added some rules for these and hoping for hits
Comment 10 AXB 2014-09-22 12:13:15 UTC
I'd suggest putting CCTlds (util_rb_tld) in separate file which could make it easier to mantain, programatically.
eg 10_cc_tld.cf

comments?
Comment 11 Henrik Krohns 2014-12-06 09:35:44 UTC
Mark, do you have any comments?

This seems to be a blocker and Kevin isn't commenting anything else except "it's intrusive"? How else do you expect to access configuration data?

Obviously needs a bit more honing, documentation and adding some api functions instead of directly reading variables. But is there a better method to go about it?
Comment 12 Kevin A. McGrail 2014-12-08 12:24:02 UTC
(In reply to Henrik Krohns from comment #11)
> Mark, do you have any comments?
> 
> This seems to be a blocker and Kevin isn't commenting anything else except
> "it's intrusive"? How else do you expect to access configuration data?
> 
> Obviously needs a bit more honing, documentation and adding some api
> functions instead of directly reading variables. But is there a better
> method to go about it?

This isn't the last blocker so please don't stress. I will use your code if it comes down to it because an idea that works with hacks is not as good as working code.

I'll see if I can get txrep going as well.
Comment 13 Joe Quinn 2015-04-06 15:09:21 UTC
I added two lines of bash to this .cf file that will semi-automate fetching the latest TLDs and putting them in the top two blocks. I ran them to get an up-to-date list because it's been a while since this ticket was opened.

There shouldn't be any problem with committing just the cf, and everything that's new is guarded behind a can() check.

It should be doubly safe if AXB's masscheck box stays broken, too! ;)

Committed revision 1671546.
Comment 14 AXB 2015-04-06 16:42:04 UTC
(In reply to Joe Quinn from comment #13)
> I added two lines of bash to this .cf file that will semi-automate fetching
> the latest TLDs and putting them in the top two blocks. I ran them to get an
> up-to-date list because it's been a while since this ticket was opened.
> 
> There shouldn't be any problem with committing just the cf, and everything
> that's new is guarded behind a can() check.
> 
> It should be doubly safe if AXB's masscheck box stays broken, too! ;)
> 
> Committed revision 1671546.

btw, Henrik's patch works fine - just needs to be accepted, merged,and cosolidate the .cf stuff...
Comment 15 Kevin A. McGrail 2015-04-06 16:48:03 UTC
Agreed.  We are taking it to the point of actually putting a cf into rules that hopefully will be delivered by sa-update to confirm it works end-to-end to release 3.4.1.
Comment 16 AXB 2015-04-06 16:53:16 UTC
FreeMail.pm includes:

# List of TLDs from RegistrarBoundaries.pm
my $tlds = $Mail::SpamAssassin::Util::RegistrarBoundaries::VALID_TLDS_RE;

This should be looked into - 
include backward compatibility / version dependency
Comment 17 Henrik Krohns 2015-04-06 17:04:35 UTC
FreeMail is patched in all the files I provided..
Comment 18 Kevin A. McGrail 2015-04-06 17:07:29 UTC
With me, AXB and Henrik's defacto, this can be committed.

Joe is working on committing his patch.  Just has some patches that aren't clean because of the length of time since this was done and changes to trunk since.
Comment 19 Joe Quinn 2015-04-06 17:57:32 UTC
Would have replied earlier but resolving the patch consumed my attention.

The cf was a separate commit because it seemed cleaner that way and it made the next commit slightly smaller. The cf goes "back in time" via sa-update so it seemed conceptually separate enough to merit it.

Henrik, your patch is now merged. I took the liberty of adding the magic word "deprecated" to RegistrarBoundaries.pm and I also added two bash lines to semi-automate maintenance of the cf file.

Committed revision 1671621.
Comment 20 AXB 2015-04-06 18:32:20 UTC
KAM,
Could the update of 20_aux_tlds.cf be cron'd to check if there are any changes to http://data.iana.org/TLD/tlds-alpha-by-domain.txt and published via sa-update?
(not depending on availability of autopromoted rules)
Comment 21 Joe Quinn 2015-04-06 18:35:47 UTC
There's two lines of bash included in the cf that you can run to get the current data formatted correctly and split into the appropriate sections, so half the automation is already done.
Comment 22 Kevin A. McGrail 2015-04-06 18:49:25 UTC
(In reply to AXB from comment #20)
> KAM,
> Could the update of 20_aux_tlds.cf be cron'd to check if there are any
> changes to http://data.iana.org/TLD/tlds-alpha-by-domain.txt and published
> via sa-update?
> (not depending on availability of autopromoted rules)

Yes, it could be automated but my goal right now is requirements to release 3.4.1

The cf is currently delivered as part of the standard rules channel so it depends on a published rule set.

Can you open a separate bug to see if we can automate the TLD updates?  That can be done after 3.4.1 is rolled out.
Comment 23 Kevin A. McGrail 2015-04-06 19:54:19 UTC
Bug 7165 open re: Updating TLDs via Cron.
Comment 24 Henrik Krohns 2015-04-08 12:38:15 UTC
Fixed bug detected from dnsbl.t

Sending        PerMsgStatus.pm
Transmitting file data .
Committed revision 1672078.
Comment 25 Joe Quinn 2015-04-08 20:05:22 UTC
As a side note on Henrik's patch by the way, even though it notes the trie optimization in 5.10.1, our testing on 5.8.8 completed faster than our newer perl versions, on the same machine. A good data point to note, since 5.8 is still supported.
Comment 26 Kevin A. McGrail 2015-04-08 20:19:36 UTC
Considering resolved.
Comment 27 Michael Parker 2015-04-08 22:36:34 UTC
It ran faster on 5.8 because regex support changed in the 5.10 timeframe and generally ended up slower.
Comment 28 AXB 2015-04-09 16:59:55 UTC
Any special reason the new RegistryBoundaries.pm is not in /lib/Mail/SpamAssassin/Util ?
Comment 29 Mark Martinec 2015-04-09 17:41:14 UTC
> Any special reason the new RegistryBoundaries.pm is not in
> /lib/Mail/SpamAssassin/Util ?

They need their own namespace so that an object can be created (->new)
holding the environment for its methods.