The generate.max.per.host parameter does work, but with the following limitations that we've run into:
1. The current code uses the entire hostname when deciding max links/host. There are a lot of spammy sites out there that have URLs with the form xxxx-somedomain.com, where xxxx is essentially a random number.
We've got code that does a better job of deriving the true "base' domain name, but then there's...
2. Sites that actually have many IP addresses (not sure if they're in a common subnet block or not), where the domain name is xxxx-somedomain.com.
Because of these two link farm techniques, we ran into cases of 100K links essentially being fetched from the same spam-laden domain, even with a generate.max.per.host setting of 50, after about 40+ loops.
And what's really unfortunate is that many of these sites are low-bandwidth hosters in Korea and China, so your crawl speed drops dramatically because you're spending all your time waiting for worthless bytes to arrive.