[NUTCH-2666] Increase default value for http.content.limit / ftp.content.limit / file.content.limit - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Implemented
Affects Version/s: 1.15
Fix Version/s: 1.16
Component/s: fetcher
Labels:
None

Description

The default value for http.content.limit in nutch-default.xml (The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this default value should be increased as many pages today are greater than 64kb.

This fact hit me when trying to crawl a single website whose pages are much greater than 64kb and because of that with every crawl cycle the count of db_unfetched urls decreased until it hit zero and the crawler became inactive (because the first 64 kB contained always the same set of navigation links)

The description might also be updated as this is not only the case for the http protocol, but also for https.

Attachments

Issue Links

is related to

NUTCH-2511 SitemapProcessor limited by http.content.limit

Closed

links to

GitHub Pull Request #427

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Marco Ebbinghaus

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Oct/18 14:27

Updated:: 28/Jan/21 13:56

Resolved:: 10/Apr/19 11:40