Hi Tejas - you're right for (1), it should indeed be host_a.example.org, host_b.example.org ==> example.org but not x.xyz.org, a.abc.org ==> unknown. The reducer should take the domain + suffix as key and then emit the domain if ALL hosts are unknown. If you emit a domain if most but not all hosts are unknown, the DomainBlacklistURLFilter will remove the entire domain from the CrawlDB and WebgraphDB.
The example for (2) does not include cross-domain redirects but the problem is similar. I think it works fine for now because multi-redirects are not very common on the entire internet.
A larger problem is the filterNormalize() method. It actually receives a hostname, not a URL. And to pass URL filters we must prepend the URL scheme to make it look like a URL. I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed.
I think i've got a slightly newer version of the tools but don't know what actually changed in the past year. I'll try to diff and upload it.