Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-753

Prevent new Fetcher to retrieve the robots twice

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.1
    • Component/s: fetcher
    • Labels:
      None

      Description

      The new Fetcher which is now used by default handles the robots file directly instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. However in practice the robots file is still fetched as there is a call to robots.getCrawlDelay() a bit further which is not covered by the if (Protocol.CHECK_ROBOTS).

        Attachments

        1. NUTCH-753.patch
          1 kB
          Julien Nioche

          Activity

            People

            • Assignee:
              ab Andrzej Bialecki
              Reporter:
              jnioche Julien Nioche
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: