Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2808

Document side effects of ignoring robots.txt

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • None
    • 1.19
    • documentation, robots
    • None

    Description

      (see NUTCH-1927 and NUTCH-2803)

      The aim of NUTCH-1927 was to make it possible to ignore the robots.txt for a defined set of hosts/domains. Ignoring the robots.txt entirely has some site effects which should be documented:

      • undesired content (duplicates, private pages, etc.) may get indexed
      • the Crawl-Delay is ignored
      • no sitemaps are detected (cf. NUTCH-2807)

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: