Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1042

Fetcher.max.crawl.delay property not taken into account correctly when set to -1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.3
    • 1.7, 2.2
    • fetcher
    • None

    Description

      [Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]

      From nutch-default.xml:

      "
      <property>
      <name>fetcher.max.crawl.delay</name>
      <value>30</value>
      <description>
      If the Crawl-Delay in robots.txt is set to greater than this value (in
      seconds) then the fetcher will skip this page, generating an error report.
      If set to -1 the fetcher will never skip such pages and will wait the
      amount of time retrieved from robots.txt Crawl-Delay, however long that
      might be.
      </description>
      </property>
      "

      Fetcher.java:
      (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).

      The line 554 in Fetcher.java: "this.maxCrawlDelay =
      conf.getInt("fetcher.max.crawl.delay", 30) * 1000;" .

      The lines 615-616 in Fetcher.java:

      "
      if (rules.getCrawlDelay() > 0) {
      if (rules.getCrawlDelay() > maxCrawlDelay) {
      "

      Now, the documentation states that, if fetcher.max.crawl.delay is set to
      -1, the crawler will always wait the amount of time the Crawl-Delay
      parameter specifies. However, as you can see, if it really is negative
      the condition on the line 616 is always true, which leads to skipping
      the page whose Crawl-Delay is set.

      Attachments

        Issue Links

          Activity

            People

              lewismc Lewis John McGibbney
              nutch_user_1 Nutch User - 1
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: