I have made major improvements to the code and configuration files. Mainly the issue is not only a plugin, but a package, one big xml file, and an indexing/scoring plugin(which is disabled by default). The list of recognized suffixes now is not limited to top level domains. second, or third level public domain names can be recognized. The patch also changes the naming from top level domains to domain suffixes.
This patch also introduces URLUtil class, which include methods for getting domain name, or public domain suffix of an url. Finding the domain name of a url is quite important for several reasons. First we can use this function as an replacement of URL.getHost() in LinkDB for ignoring internal links, or in similar context. Second we can perform statistical analysis on domain names. Third we can list subdomains under a domain, etc..
I have changed the build.encoding to UTF-8 so that non-ascii characters are recognized.
here is an excerpt from the domain-suffixes.xml file :
This document contains top level domains
as described by the Internet Assigned Numbers
Authotiry (IANA), and second or third level domains that
are known to be managed by domain registerers. People at
Mozilla community call these as public suffixes or effective
tlds. There is no algorithmic way of knowing whether a suffix
is a public domain suffix, or not. So this large file is used
for this purpose. The entries in the file is used to find the
domain of a url, which may not the same thing as the host of
the url. For example for "http://lucene.apache.org/nutch" the
hostname is lucene.apache.org, however the domain name for this
url would be apache.org. Domain names can be quite handy for
statistical analysis, and fighting against spam.
The list of TLDs is constructed from IANA, and the
list of "effective tlds" are constructed from Wikipedia,
http://wiki.mozilla.org/TLD_List, and http://publicsuffix.org/
The list may not include all the suffixes, but some
effort has been spent to make it comprehensive. Please forward
any improvements for this list to nutch-dev mailing list, or