SA Bugzilla – Bug 6782
TLD update handling / RegistrarBoundaries need improving
Last modified: 2015-04-09 17:41:14 UTC
We have config options util_rb_tld, 2tld etc. What these do is add values to Mail::SpamAssassin::Util::RegistrarBoundaries::VALID_TLDS etc from Conf.pm. Problem is that only some very internal SA functions can make use of the updates. I couldn't figure out how this exactly works. Atleast HTML parser etc see them, since URIBL uses those results and util_rb_tld does query new domains/TLDs using it (for uris with scheme). Anything that uses the static $VALID_TLDS_RE will not see updates. This includes schemeless uri parsing (so URIBL does not work for those) and FreeMail. Anything that uses Mail::SpamAssassin::Util::RegistrarBoundaries "directly" will not see those updates. Some suspect code: ./Plugin/URIDNSBL.pm: $nsrhblstr = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($nsmatch); ./Plugin/HTTPSMismatch.pm: $uri = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($uri); ./Plugin/HTTPSMismatch.pm: undef $uri unless (Mail::SpamAssassin::Util::RegistrarBoundaries::is_domain_valid($uri)); ./Plugin/HTTPSMismatch.pm: $https = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($https); ./Plugin/HTTPSMismatch.pm: undef $https unless (Mail::SpamAssassin::Util::RegistrarBoundaries::is_domain_valid($https)); ./Plugin/WLBLEval.pm: Mail::SpamAssassin::Util::RegistrarBoundaries::split_domain($addr_domain); ./Plugin/WLBLEval.pm: Mail::SpamAssassin::Util::RegistrarBoundaries::split_domain($relay_rdns); ./Plugin/HeaderEval.pm: $dom = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($dom); ./Plugin/HeaderEval.pm: (Mail::SpamAssassin::Util::RegistrarBoundaries::is_domain_valid($dom)); ./Util.pm: $uri = Mail::SpamAssassin::Util::RegistrarBoundaries::trim_domain($uri); ./Util.pm: (Mail::SpamAssassin::Util::RegistrarBoundaries::is_domain_valid($uri)); Some serious reorganization is needed if we want to properly support TLD list updates via sa-update.
Cross-referencing discussion from Bug 6926: > D. Stussy 2013-04-25 01:01:14 UTC > Periodically walking the root zone's NSEC-RR list may be the best way to regenerate the TLD part of the list, especially as we only care about currently resolvable TLDs. From there, the automated process would may any necessary modifications (would these all be additions?). > Adam Katz 2013-04-11 01:13:39 UTC > While on the ~tld topic, I see we don't yet include https://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 (for 2tld and 3tld). I haven't vetted that to see if it's worthwhile, but in doing some research a while back, it looked ideal. In addition to the Mozilla DB, IANA has a list of TLDs in similar format here: http://data.iana.org/TLD/tlds-alpha-by-domain.txt While I am in favor of removing $VALID_TLDS_RE, there are users who depend on it existing, so I don't think pushing TLDs through sa-update will be a good way to go.
Ok I steamed all evening to come up with something. I'll upload the changes/diffs here before committing a large blob like this.. reviews, suggestions, votes welcomed.. - Let's deprecate Util/RegistrarBoundaries.pm completely, but keep it bundled for now because of third party plugins etc. Maybe delete all the leftovers on 3.5 release. - Implement Mail::SpamAssassin::RegistryBoundaries (now is a good time to "fix" the name :-D), basically the same code just copied - Only way I managed to make the tld lists work dynamically everywhere is load RegistryBoundaries in the root SpamAssassin.pm (even before plugins are initialized, so plugins can se all the tlds there) - Basically replaced Mail::SpamAssassin::Util::RegistrarBoundaries::xxxx_domain() calls everywhere with $self->{main}->{registryboundaries}->xxxx_domain() - Conf.pm (temporarily) includes hardcoded list of domains, so things still work before sa-update is done to get the updated 20_aux_tlds.cf, which now includes all TLDs (new function clear_util_rb makes sure hardcoded values are ignored) - Mail::SpamAssassin::Util::uri_to_domain is deprecated and moved to RegistryBoundaries since it needs the dynamic $self variables Everything should work without breaking anything on any version.. maybe I missed some scenario so check it out. :-)
Created attachment 5235 [details] Complete diff vs current trunk
Created attachment 5236 [details] All changed files for easy extraction/testing lib/Mail/SpamAssassin/Conf.pm lib/Mail/SpamAssassin/PerMsgStatus.pm lib/Mail/SpamAssassin/Plugin/FreeMail.pm lib/Mail/SpamAssassin/Plugin/HTTPSMismatch.pm lib/Mail/SpamAssassin/Plugin/HeaderEval.pm lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm lib/Mail/SpamAssassin/Plugin/URIEval.pm lib/Mail/SpamAssassin/Plugin/WLBLEval.pm lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm lib/Mail/SpamAssassin/Util.pm lib/Mail/SpamAssassin.pm rules/20_aux_tlds.cf t/ip_addrs.t t/uri.t (extract these over svn and make test; make install)
Created attachment 5237 [details] The cf that we might put into sa-update, for easy viewing here..
I was working on this as well and my thoughts were to deliver the data to then process using new dependencies limiting the changes to RegistrarBoundaries.pm. I need to look but your change seems far more intrusive.
Feel free to describe what you had in mind? Since Util-stuff do not have access by design to any configuration or state data, I have a hard time imagining how this would effectively work..
Created attachment 5238 [details] All changed files for easy extraction/testing Oops, last package was missing the obvious lib/Mail/SpamAssassin/RegistryBoundaries.pm ... also cp rules/20_aux_tlds.cf manually in place after sa-update.
Running SA/spamd on Centos 6.5 with these patches on a trap box and see no issues. make test: OK Freemail.pm: OK URIBL.pm lookups: ok white/blacklist_from/etc: OK due to low traffic I haven't been able to test whitelist_uri_host blacklist_uri_host added some rules for these and hoping for hits
I'd suggest putting CCTlds (util_rb_tld) in separate file which could make it easier to mantain, programatically. eg 10_cc_tld.cf comments?
Mark, do you have any comments? This seems to be a blocker and Kevin isn't commenting anything else except "it's intrusive"? How else do you expect to access configuration data? Obviously needs a bit more honing, documentation and adding some api functions instead of directly reading variables. But is there a better method to go about it?
(In reply to Henrik Krohns from comment #11) > Mark, do you have any comments? > > This seems to be a blocker and Kevin isn't commenting anything else except > "it's intrusive"? How else do you expect to access configuration data? > > Obviously needs a bit more honing, documentation and adding some api > functions instead of directly reading variables. But is there a better > method to go about it? This isn't the last blocker so please don't stress. I will use your code if it comes down to it because an idea that works with hacks is not as good as working code. I'll see if I can get txrep going as well.
I added two lines of bash to this .cf file that will semi-automate fetching the latest TLDs and putting them in the top two blocks. I ran them to get an up-to-date list because it's been a while since this ticket was opened. There shouldn't be any problem with committing just the cf, and everything that's new is guarded behind a can() check. It should be doubly safe if AXB's masscheck box stays broken, too! ;) Committed revision 1671546.
(In reply to Joe Quinn from comment #13) > I added two lines of bash to this .cf file that will semi-automate fetching > the latest TLDs and putting them in the top two blocks. I ran them to get an > up-to-date list because it's been a while since this ticket was opened. > > There shouldn't be any problem with committing just the cf, and everything > that's new is guarded behind a can() check. > > It should be doubly safe if AXB's masscheck box stays broken, too! ;) > > Committed revision 1671546. btw, Henrik's patch works fine - just needs to be accepted, merged,and cosolidate the .cf stuff...
Agreed. We are taking it to the point of actually putting a cf into rules that hopefully will be delivered by sa-update to confirm it works end-to-end to release 3.4.1.
FreeMail.pm includes: # List of TLDs from RegistrarBoundaries.pm my $tlds = $Mail::SpamAssassin::Util::RegistrarBoundaries::VALID_TLDS_RE; This should be looked into - include backward compatibility / version dependency
FreeMail is patched in all the files I provided..
With me, AXB and Henrik's defacto, this can be committed. Joe is working on committing his patch. Just has some patches that aren't clean because of the length of time since this was done and changes to trunk since.
Would have replied earlier but resolving the patch consumed my attention. The cf was a separate commit because it seemed cleaner that way and it made the next commit slightly smaller. The cf goes "back in time" via sa-update so it seemed conceptually separate enough to merit it. Henrik, your patch is now merged. I took the liberty of adding the magic word "deprecated" to RegistrarBoundaries.pm and I also added two bash lines to semi-automate maintenance of the cf file. Committed revision 1671621.
KAM, Could the update of 20_aux_tlds.cf be cron'd to check if there are any changes to http://data.iana.org/TLD/tlds-alpha-by-domain.txt and published via sa-update? (not depending on availability of autopromoted rules)
There's two lines of bash included in the cf that you can run to get the current data formatted correctly and split into the appropriate sections, so half the automation is already done.
(In reply to AXB from comment #20) > KAM, > Could the update of 20_aux_tlds.cf be cron'd to check if there are any > changes to http://data.iana.org/TLD/tlds-alpha-by-domain.txt and published > via sa-update? > (not depending on availability of autopromoted rules) Yes, it could be automated but my goal right now is requirements to release 3.4.1 The cf is currently delivered as part of the standard rules channel so it depends on a published rule set. Can you open a separate bug to see if we can automate the TLD updates? That can be done after 3.4.1 is rolled out.
Bug 7165 open re: Updating TLDs via Cron.
Fixed bug detected from dnsbl.t Sending PerMsgStatus.pm Transmitting file data . Committed revision 1672078.
As a side note on Henrik's patch by the way, even though it notes the trie optimization in 5.10.1, our testing on 5.8.8 completed faster than our newer perl versions, on the same machine. A good data point to note, since 5.8 is still supported.
Considering resolved.
It ran faster on 5.8 because regex support changed in the 5.10 timeframe and generally ended up slower.
Any special reason the new RegistryBoundaries.pm is not in /lib/Mail/SpamAssassin/Util ?
> Any special reason the new RegistryBoundaries.pm is not in > /lib/Mail/SpamAssassin/Util ? They need their own namespace so that an object can be created (->new) holding the environment for its methods.