Description
HostDB for Apache Nutch 1.x
- automatically generates a HostDB based on CrawlDB information
- periodically performs DNS lookup for all hosts and keeps track of DNS failures
- discovers homepage if www.example.org/ is a redirect
- keeps track of host statistics such as number of URL's, 404's, not modifieds and redirects
- aggregates CrawlDB metadata fields into totals, sums, min, max, average and configurable percentiles
- can output lists of discovered homepage URL's for seed lists and static fetch interval
*can output blacklists for hosts that have too many DNS failures to filter from the CrawlDB using domainblacklist-urlfilter - just like CrawlDB support for JEXL expressions
Examples
Generate for the first time, or update and existing HostDB:
bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
Optional filtering or normalizing:
bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter -normalize
Dumping as CSV file:
bin/nutch readhostdb crawl/hostdb output_directory
Get only hostnames with have average response time above 50ms:
bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(avg._rs_ > 50)"
Get only hosts that have over 50% 404's:
bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr "(gone / numRecords > 0.5)"
For JEXL expressions, all host metadata fields are available. All other fields are also available as:
unfetched – number of unfetched records
fetched – number of fetched records
gone – number of 404's
redirTemp – number if temporary redirects
redirPerm – number if permanent redirects
redirs – total number of redirects (redirTemp + redirPerm)
notModified – number of not modified records
ok – number of usable pages (fetched + notModified)
numRecords – total number of records
dnsFailures – number of DNS failures
Also, see nutch-default for hostdb.* properties.
Attachments
Attachments
Issue Links
- supercedes
-
NUTCH-1149 DomainStats should process numeric CrawlDB metadata
- Closed