Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2996

Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Implemented
    • 1.20
    • 1.20
    • robots
    • None

    Description

      Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) introduces a new API entry point to parse the robots.txt content:

      • it's more efficient by accepting a collection of lower-cased, single-word user-agent product tokens, without the need to tokenize a (comma-separated) list of user-agent strings again with every robots.txt
      • user-agent matching is compliant with RFC 9309 (section 2.2.1) only if the new API method is used

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: