Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-753

Prevent new Fetcher to retrieve the robots twice

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.1
    • fetcher
    • None

    Description

      The new Fetcher which is now used by default handles the robots file directly instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. However in practice the robots file is still fetched as there is a call to robots.getCrawlDelay() a bit further which is not covered by the if (Protocol.CHECK_ROBOTS).

      Attachments

        1. NUTCH-753.patch
          1 kB
          Julien Nioche

        Activity

          People

            ab Andrzej Bialecki
            jnioche Julien Nioche
            Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: