Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-731

Redirection of robots.txt in RobotRulesParser

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.1
    • fetcher
    • None
    • Patch Available

    Description

      The patch attached allows to follow one level of redirection for robots.txt files. A similar issue was mentioned in NUTCH-124 and has been marked as fixed a long time ago but the problem remained, at least when using Fetcher2 . Mathijs Homminga pointed to the problem in a mail to the nutch-dev list in March.

      I have been using this patch for a while now on a large cluster and noticed that the ratio of robots_denied per fetchlist went up, meaning that at least we are now getting restrictions we would not have had before (and getting less complaints from webmasters at the same time)

      Attachments

        1. NUTCH-731.patch
          1.0 kB
          Julien Nioche

        Issue Links

          Activity

            People

              ab Andrzej Bialecki
              jnioche Julien Nioche
              Votes:
              2 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: