Description
The default value for http.content.limit in nutch-default.xml (The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this default value should be increased as many pages today are greater than 64kb.
This fact hit me when trying to crawl a single website whose pages are much greater than 64kb and because of that with every crawl cycle the count of db_unfetched urls decreased until it hit zero and the crawler became inactive (because the first 64 kB contained always the same set of navigation links)
The description might also be updated as this is not only the case for the http protocol, but also for https.
Attachments
Issue Links
- is related to
-
NUTCH-2511 SitemapProcessor limited by http.content.limit
- Closed
- links to