Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2666

Increase default value for http.content.limit / ftp.content.limit / file.content.limit

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • 1.15
    • 1.16
    • fetcher
    • None

    Description

      The default value for http.content.limit in nutch-default.xml (The length limit for downloaded content using the http://
      protocol, in bytes. If this value is nonnegative (>=0), content longer
      than it will be truncated; otherwise, no truncation at all. Do not
      confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this default value should be increased as many pages today are greater than 64kb.

      This fact hit me when trying to crawl a single website whose pages are much greater than 64kb and because of that with every crawl cycle the count of db_unfetched urls decreased until it hit zero and the crawler became inactive (because the first 64 kB contained always the same set of navigation links)

      The description might also be updated as this is not only the case for the http protocol, but also for https.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              mebbinghaus Marco Ebbinghaus
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: