Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2754

fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.16
    • 1.17
    • fetcher, robots
    • None

    Description

      Sites specifying a Crawl-Delay of more than 5 minutes (301 seconds or more) are always ignored, even if fetcher.max.crawl.delay is set to a higher value.

      We need to pass a higher value of fetcher.max.crawl.delay to crawler-commons' robots.txt parser otherwise it will use the internal default value of 300 sec. and disallow all sites specifying a longer Crawl-Delay in their robots.txt.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: