Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-101

RobotRulesParser

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.6, 0.7, 0.7.1, 0.8
    • None
    • fetcher
    • None

    Description

      I noticed this code in protocol-http & protocol-httpclient plugins:

      } else if ( (line.length() >= 6)
      && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {

      However, according to the original 1994 protocol description, there is NO "Allow:" field. To allow, simply use "Disallow: ". http://www.robotstxt.org/wc/norobots.html

      Please, try to test with www.newegg.com/robots.txt

      • their site has this:
        User-agent: *
        Disallow:

      And Nutch does not work with New Egg, but it should!

      Sorry guys, I don't have enough time to double-ensure, could you please verify all this...

      I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we need to test ......./robots.txt

      User-agent: ia_archiver
      Disallow: /

      User-agent: Googlebot-Image
      Disallow: /

      User-agent: Nutch
      Disallow: /

      User-agent: TurnitinBot
      Disallow: /

      • everything according to standard protocol. Can you retest please whether it works with multiline? It's a standard!

      I see this in code:
      StringTokenizer tok = new StringTokenizer(agentNames, ",");

      Comma separated? It's not accepted standard yet...

      Sorry WebExpertsAmerica, I really didn't have any time to make any test...

      Please do not execute tests against production sites.
      Thanks!

      Attachments

        Activity

          People

            Unassigned Unassigned
            funtick Fuad Efendi
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: