Nutch
  1. Nutch
  2. NUTCH-1031

Delegate parsing of robots.txt to crawler-commons

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7, 2.2
    • Component/s: None
    • Labels:

      Description

      We're about to release the first version of Crawler-Commons http://code.google.com/p/crawler-commons/ which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly

      1. CC.robots.multiple.agents.patch
        3 kB
        Tejas Patil
      2. CC.robots.multiple.agents.v2.patch
        5 kB
        Tejas Patil
      3. NUTCH-1031.v1.patch
        30 kB
        Tejas Patil
      4. NUTCH-1031-2.x.v1.patch
        62 kB
        Tejas Patil
      5. NUTCH-1031-trunk.v2.patch
        47 kB
        Tejas Patil
      6. NUTCH-1031-trunk.v3.patch
        55 kB
        Tejas Patil
      7. NUTCH-1031-trunk.v4.patch
        55 kB
        Tejas Patil
      8. NUTCH-1031-trunk.v5.patch
        55 kB
        Tejas Patil

        Issue Links

          Activity

          Hide
          Lewis John McGibbney added a comment -

          Hi Tejas,
          A quick note on keeping pom.xml up-to-date,
          whenever we do a release pom.xml is written absolutely up-to-date based upon the contents and configuration of ivy.xml.
          What this means is that every tag branch of Nutch has a completely accurate pom.xml and that the current development branches do not.
          I will make sure to update the pom.xml in the forthcoming releases.
          Regardless thank you the attention to detail here.

          Show
          Lewis John McGibbney added a comment - Hi Tejas, A quick note on keeping pom.xml up-to-date, whenever we do a release pom.xml is written absolutely up-to-date based upon the contents and configuration of ivy.xml. What this means is that every tag branch of Nutch has a completely accurate pom.xml and that the current development branches do not. I will make sure to update the pom.xml in the forthcoming releases. Regardless thank you the attention to detail here.
          Hide
          Tejas Patil added a comment - - edited

          I had forgot to add crawler-commons dependency in pom.xml.
          Just committed that to trunk(rev 1480551) and 2.x (rev 1480550).

          Show
          Tejas Patil added a comment - - edited I had forgot to add crawler-commons dependency in pom.xml. Just committed that to trunk(rev 1480551) and 2.x (rev 1480550).
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #587 (See https://builds.apache.org/job/Nutch-nutchgora/587/)
          NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (Revision 1477319)

          Result = FAILURE
          tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1477319
          Files :

          • /nutch/branches/2.x/CHANGES.txt
          • /nutch/branches/2.x/ivy/ivy.xml
          • /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/protocol/EmptyRobotRules.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/protocol/Protocol.java
          • /nutch/branches/2.x/src/java/org/apache/nutch/protocol/RobotRulesParser.java
          • /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
          • /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
          • /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
          • /nutch/branches/2.x/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
          • /nutch/branches/2.x/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
          • /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
          • /nutch/branches/2.x/src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #587 (See https://builds.apache.org/job/Nutch-nutchgora/587/ ) NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (Revision 1477319) Result = FAILURE tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1477319 Files : /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/ivy/ivy.xml /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java /nutch/branches/2.x/src/java/org/apache/nutch/protocol/EmptyRobotRules.java /nutch/branches/2.x/src/java/org/apache/nutch/protocol/Protocol.java /nutch/branches/2.x/src/java/org/apache/nutch/protocol/RobotRulesParser.java /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java /nutch/branches/2.x/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java /nutch/branches/2.x/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java /nutch/branches/2.x/src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java
          Hide
          Tejas Patil added a comment -

          Thanks Lewis
          Changes committed to 2.x (revision 1477319)

          Show
          Tejas Patil added a comment - Thanks Lewis Changes committed to 2.x (revision 1477319)
          Hide
          Lewis John McGibbney added a comment -

          +1 from me Tejas. Unit tests all pass fine and some tests I did locally we're good as well. CLI looks good. Documentation in the patch is really nice.

          Show
          Lewis John McGibbney added a comment - +1 from me Tejas. Unit tests all pass fine and some tests I did locally we're good as well. CLI looks good. Documentation in the patch is really nice.
          Hide
          Tejas Patil added a comment -

          Patch for 2.x. If there are no objections, would commit in coming days.

          Show
          Tejas Patil added a comment - Patch for 2.x. If there are no objections, would commit in coming days.
          Hide
          Lewis John McGibbney added a comment -

          Nice work Tejas

          Show
          Lewis John McGibbney added a comment - Nice work Tejas
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #2156 (See https://builds.apache.org/job/Nutch-trunk/2156/)
          NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (Revision 1465159)

          Result = SUCCESS
          tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1465159
          Files :

          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/ivy/ivy.xml
          • /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
          • /nutch/trunk/src/java/org/apache/nutch/protocol/EmptyRobotRules.java
          • /nutch/trunk/src/java/org/apache/nutch/protocol/Protocol.java
          • /nutch/trunk/src/java/org/apache/nutch/protocol/RobotRulesParser.java
          • /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
          • /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
          • /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
          • /nutch/trunk/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
          • /nutch/trunk/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
          • /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
          Show
          Hudson added a comment - Integrated in Nutch-trunk #2156 (See https://builds.apache.org/job/Nutch-trunk/2156/ ) NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (Revision 1465159) Result = SUCCESS tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1465159 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/ivy/ivy.xml /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java /nutch/trunk/src/java/org/apache/nutch/protocol/EmptyRobotRules.java /nutch/trunk/src/java/org/apache/nutch/protocol/Protocol.java /nutch/trunk/src/java/org/apache/nutch/protocol/RobotRulesParser.java /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java /nutch/trunk/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java /nutch/trunk/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
          Hide
          Tejas Patil added a comment -

          I have removed the @author tag and ported the checks from 2.x to the patch as per suggestion from Sebastian Nagel. Will commit the changes shortly to trunk and start work on porting these changes to 2.x.

          Show
          Tejas Patil added a comment - I have removed the @author tag and ported the checks from 2.x to the patch as per suggestion from Sebastian Nagel . Will commit the changes shortly to trunk and start work on porting these changes to 2.x.
          Hide
          Sebastian Nagel added a comment -

          There are differences between trunk and 2.x:

          • in org.apache.nutch.protocol.http.api.RobotRulesParser (lib-http) 2.x does additional plausibility checks for properties http.agent.name and http.robots.agents

          Maybe that's worth to take also into trunk, also with respect to porting this issue to 2.x

          Show
          Sebastian Nagel added a comment - There are differences between trunk and 2.x: in org.apache.nutch.protocol.http.api.RobotRulesParser (lib-http) 2.x does additional plausibility checks for properties http.agent.name and http.robots.agents Maybe that's worth to take also into trunk, also with respect to porting this issue to 2.x
          Hide
          Sebastian Nagel added a comment -

          +1 (nothing to complain)

          P.S.: see Julien Nioche's comment in NUTCH-1541 about the @author tag (a formality you couldn't know about)

          Show
          Sebastian Nagel added a comment - +1 (nothing to complain) P.S.: see Julien Nioche 's comment in NUTCH-1541 about the @author tag (a formality you couldn't know about)
          Hide
          Tejas Patil added a comment -

          Thanks Lewis I have corrected the usage message.

          Show
          Tejas Patil added a comment - Thanks Lewis I have corrected the usage message.
          Hide
          Lewis John McGibbney added a comment - - edited

          Hi Tejas. Sorry for taking forever to get around to this.

          • I really like to documentation within the patch. Big +1 for this
          • Test all pass flawlessly.
          • I like the retention of the main() method in o.a.n.p.RobotRulesParser

          I've tested this on several websites, including many directories within sites like bbc.co.uk (check out the robots.txt)
          I am +1 for this Tejas. Good work on this one, its been a long time in coming to Nutch.
          I am keen to hear from others.

          I have one trivial grudge, there is a typo in the usage message for the main method in RobotRulesParser. It should be

          Usage: RobotRulesParser <robots-file> <url-file> <agent-names>
          

          instead of

          Usage: RobotRulesParser <robots-file> <robots-file> <agent-names>
          
          Show
          Lewis John McGibbney added a comment - - edited Hi Tejas. Sorry for taking forever to get around to this. I really like to documentation within the patch. Big +1 for this Test all pass flawlessly. I like the retention of the main() method in o.a.n.p.RobotRulesParser I've tested this on several websites, including many directories within sites like bbc.co.uk (check out the robots.txt) I am +1 for this Tejas. Good work on this one, its been a long time in coming to Nutch. I am keen to hear from others. I have one trivial grudge, there is a typo in the usage message for the main method in RobotRulesParser. It should be Usage: RobotRulesParser <robots-file> <url-file> <agent-names> instead of Usage: RobotRulesParser <robots-file> <robots-file> <agent-names>
          Hide
          Tejas Patil added a comment -

          @Dev: Can anyone kindly review the patch ?

          Show
          Tejas Patil added a comment - @Dev: Can anyone kindly review the patch ?
          Hide
          Tejas Patil added a comment -

          Hey Lewis, Thanks for pointing that out I have updated the patch.

          Show
          Tejas Patil added a comment - Hey Lewis, Thanks for pointing that out I have updated the patch.
          Hide
          Lewis John McGibbney added a comment -

          MHi Tejas. If you go to search maven you will see the 0.2 release of crawler commons. You will be able to pull this with ivy no bother. @Tejas, I agree with your views on keeping CC in core ivy.xml as it is likely that we will use it for the sitemaps at some stage as well. Great work Tejas.

          Show
          Lewis John McGibbney added a comment - MHi Tejas. If you go to search maven you will see the 0.2 release of crawler commons. You will be able to pull this with ivy no bother. @Tejas, I agree with your views on keeping CC in core ivy.xml as it is likely that we will use it for the sitemaps at some stage as well. Great work Tejas.
          Hide
          Tejas Patil added a comment -

          Hi Sebastian Nagel, I have done the suggested changes.

          @lufeng : #1 done. As the newer version of CC aint released publically, (I cannot see it over the project page or in maven), I am avoiding #2 for now. For #3, I am not keen about creating a robots plugin as robots check is something which is mandatory for every crawler. Hence I have kept RobotRulesParser class as core. However, the protocol specific robots implementations (currently HttpRobotRulesParser is added in this patch) are inside the respective protocol plugins.

          Show
          Tejas Patil added a comment - Hi Sebastian Nagel , I have done the suggested changes. @ lufeng : #1 done. As the newer version of CC aint released publically, (I cannot see it over the project page or in maven), I am avoiding #2 for now. For #3, I am not keen about creating a robots plugin as robots check is something which is mandatory for every crawler. Hence I have kept RobotRulesParser class as core. However, the protocol specific robots implementations (currently HttpRobotRulesParser is added in this patch) are inside the respective protocol plugins.
          Hide
          lufeng added a comment -

          Hi Tejas

          1. The EmptyRobotRules class is not delete in patch NUTCH-1031-trunk.v2.patch file.
          2. Shoud we add CC dependency in ivy.xml configuration.
          3. Can we create a RobotRulesParser as a nutch plugin and extract the Protocol#getRobotRules method. So we can move the CC dependency from nutc-core to nutch-plugin.

          Thanks

          Show
          lufeng added a comment - Hi Tejas 1. The EmptyRobotRules class is not delete in patch NUTCH-1031 -trunk.v2.patch file. 2. Shoud we add CC dependency in ivy.xml configuration. 3. Can we create a RobotRulesParser as a nutch plugin and extract the Protocol#getRobotRules method. So we can move the CC dependency from nutc-core to nutch-plugin. Thanks
          Hide
          Tejas Patil added a comment -

          Hi Sebastian,
          Thanks for your time and suggesting the changes.
          regarding the junits: I would remove those from nutch as CC already has a their own tests and no point in testing it again in nutch.

          Show
          Tejas Patil added a comment - Hi Sebastian, Thanks for your time and suggesting the changes. regarding the junits: I would remove those from nutch as CC already has a their own tests and no point in testing it again in nutch.
          Hide
          Sebastian Nagel added a comment -

          Hi Tejas, a test of NUTCH-1031-trunk.v2.patch in combination with crawler-commons-2.0 shows:

          • protocol.RobotRulesParser.main does not work properly:
            • robotName is not filled properly by the <agent-name>+ arguments
            • parsed rules are printed in the Object string representation (e.g., SimpleRobotRules@2c2f1921)
          • testRobotsTwoAgents failed. However, the tests are quite complex: Shouldn't we trust on the exhaustive tests by crawler-commons? A simple test may be sufficient to test the basic functionality and, eg. agent names separated by comma.
          Show
          Sebastian Nagel added a comment - Hi Tejas, a test of NUTCH-1031 -trunk.v2.patch in combination with crawler-commons-2.0 shows: protocol.RobotRulesParser.main does not work properly: robotName is not filled properly by the <agent-name>+ arguments parsed rules are printed in the Object string representation (e.g., SimpleRobotRules@2c2f1921) testRobotsTwoAgents failed. However, the tests are quite complex: Shouldn't we trust on the exhaustive tests by crawler-commons? A simple test may be sufficient to test the basic functionality and, eg. agent names separated by comma.
          Hide
          Tejas Patil added a comment -

          @Dev: I am planning to commit this change in coming days. If anyone has suggestions please feel free to share your thoughts.

          Show
          Tejas Patil added a comment - @Dev: I am planning to commit this change in coming days. If anyone has suggestions please feel free to share your thoughts.
          Hide
          Tejas Patil added a comment -

          Hi Lewis,

          I should have checked on the main page of CC before asking over jira. Anyways, thanks for news

          Regarding - "delegating the functionality": I had already done that change for both 1.x and 2.x last month. Was waiting for the release of CC. If possible, can you review the patches ?

          Show
          Tejas Patil added a comment - Hi Lewis, I should have checked on the main page of CC before asking over jira. Anyways, thanks for news Regarding - "delegating the functionality": I had already done that change for both 1.x and 2.x last month. Was waiting for the release of CC. If possible, can you review the patches ?
          Hide
          Lewis John McGibbney added a comment -

          Hi Tejas. We released it

          Really sorry for not updating

          https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583013#comment-13583013]
          CC.robots.multiple.agents.v2.patch, NUTCH-1031-trunk.v2.patch,
          NUTCH-1031.v1.patch
          http://code.google.com/p/crawler-commons/] which contains a parser for
          robots.txt files. This parser should also be better than the one we
          currently have in Nutch. I will delegate this functionality to CC as soon
          as it is available publicly
          administrators


          Lewis

          Show
          Lewis John McGibbney added a comment - Hi Tejas. We released it Really sorry for not updating https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583013#comment-13583013 ] CC.robots.multiple.agents.v2.patch, NUTCH-1031 -trunk.v2.patch, NUTCH-1031 .v1.patch http://code.google.com/p/crawler-commons/ ] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly administrators – Lewis
          Hide
          Tejas Patil added a comment -

          Hey Ken, A gentle reminder for releasing CC.

          Show
          Tejas Patil added a comment - Hey Ken, A gentle reminder for releasing CC.
          Hide
          Ken Krugler added a comment -

          I've rolled this into trunk at crawler-commons. Next step is to roll a release. Not sure when I'll get to that, but on my list for this week.

          Show
          Ken Krugler added a comment - I've rolled this into trunk at crawler-commons. Next step is to roll a release. Not sure when I'll get to that, but on my list for this week.
          Hide
          Ken Krugler added a comment -

          Hi Tejas,

          I've been on the road, but I'll check out your patch when I return back to my office tomorrow. Thanks for updating it with a test case!

          – Ken

          Show
          Ken Krugler added a comment - Hi Tejas, I've been on the road, but I'll check out your patch when I return back to my office tomorrow. Thanks for updating it with a test case! – Ken
          Hide
          Tejas Patil added a comment - - edited

          Added a patch for nutch trunk (NUTCH-1031-trunk.v2.patch). If nobody has objection, i will work on corresponding patch for 2.x.
          Summary of the changes done:

          • Removed RobotRules class as CC provides a replacement: BaseRobotRules
          • Moved RobotRulesParser from http plugin in account to NUTCH-1513, other protocols might share it.
          • Added HttpRobotRulesParser which will be responsible for getting the robots file for http protocol.
          • Changed references from old nutch classes to classes from CC.
          Show
          Tejas Patil added a comment - - edited Added a patch for nutch trunk ( NUTCH-1031 -trunk.v2.patch). If nobody has objection, i will work on corresponding patch for 2.x. Summary of the changes done: Removed RobotRules class as CC provides a replacement: BaseRobotRules Moved RobotRulesParser from http plugin in account to NUTCH-1513 , other protocols might share it. Added HttpRobotRulesParser which will be responsible for getting the robots file for http protocol. Changed references from old nutch classes to classes from CC.
          Hide
          Tejas Patil added a comment -

          Hi Ken, I have added a test case to CC for the change. (CC.robots.multiple.agents.v2.patch)

          Show
          Tejas Patil added a comment - Hi Ken, I have added a test case to CC for the change. (CC.robots.multiple.agents.v2.patch)
          Hide
          Ken Krugler added a comment -

          Regarding precedence - my guess is that it's not very important, as I haven't seen many (any?) robots.txt files where it would match the same robot, using related names, in rules blocks with different rules.

          This issue of precedence is specific to Nutch users, however (not part of the robots.txt RFC) so I'd suggest posting to the Nutch users list to see if anyone thinks it's important.

          As far as your review of the CC code, yes it's correct. There's one additional wrinkle in that the target user agent name is split on spaces, due to what appears to be an implicit expectation that you can use a user agent name with spaces (which based on the RFC isn't actually valid) and any piece of the name will match.

          Show
          Ken Krugler added a comment - Regarding precedence - my guess is that it's not very important, as I haven't seen many (any?) robots.txt files where it would match the same robot, using related names, in rules blocks with different rules. This issue of precedence is specific to Nutch users, however (not part of the robots.txt RFC) so I'd suggest posting to the Nutch users list to see if anyone thinks it's important. As far as your review of the CC code, yes it's correct. There's one additional wrinkle in that the target user agent name is split on spaces, due to what appears to be an implicit expectation that you can use a user agent name with spaces (which based on the RFC isn't actually valid) and any piece of the name will match.
          Hide
          Tejas Patil added a comment -

          Hi Ken,
          Thanks for reviewing the patch. I will include a test case in patch. Before that, a bigger question is whether Nutch should adopt the parsing model in CC and forget about the precedence.
          BTW: Did you find any error in my understanding about how CC parses robots ?

          Show
          Tejas Patil added a comment - Hi Ken, Thanks for reviewing the patch. I will include a test case in patch. Before that, a bigger question is whether Nutch should adopt the parsing model in CC and forget about the precedence. BTW: Did you find any error in my understanding about how CC parses robots ?
          Hide
          Ken Krugler added a comment -

          Hi Tejas - I've looked at your patch, and (assuming there's not a requirement to support precedence in the user agent name list) it seems like a valid change. Based on the RFC (http://www.robotstxt.org/norobots-rfc.txt) robot names shouldn't have commas, so splitting on that seems safe. Do you have a unit test to verify proper behavior? If so, I'd be happy to roll that into CC.

          – Ken

          Show
          Ken Krugler added a comment - Hi Tejas - I've looked at your patch, and (assuming there's not a requirement to support precedence in the user agent name list) it seems like a valid change. Based on the RFC ( http://www.robotstxt.org/norobots-rfc.txt ) robot names shouldn't have commas, so splitting on that seems safe. Do you have a unit test to verify proper behavior? If so, I'd be happy to roll that into CC. – Ken
          Hide
          Tejas Patil added a comment -

          I looked at the source code of CC to understand how it works. I have identified the change to be done to CC so that it supports multiple user agents. While testing the same, I have found that there a semantic difference in the way CC works as compared to legacy nutch parser.

          What CC does:
          It will split the http.robots.agents over comma (the change that i did locally)
          It scans the robots file line by line, each time finding if there is a match of the current "User-Agent" from file with any one of from http.robots.agents. If match is found it will take all the corresponding rules for that agent and stop further parsing.

          robots file
          User-Agent: Agent1 #foo
          Disallow: /a
          
          User-Agent: Agent2 Agent3
          Disallow: /d
          ------------------------------------
          http.robots.agents: "Agent2,Agent1"
          ------------------------------------
          Path: "/a"

          For the example above, as soon as first line of robots file is scanned, a match for "Agent1" is found. It will scan all the corresponding rules for that agent and will store only this information:

          User-Agent: Agent1
          Disallow: /a

          Rest all stuff is neglected.

          What nutch robots parser does:
          It will split the http.robots.agents over comma. It scans ALL the lines of the robots file and evaluates the matches in terms of the precedence of the user agents.
          For above example, the rules corresponding to both Agent2 and Agent1 have a match in robots file, but as Agent2 comes first in http.robots.agents, it is given priority and the rules stored will be:

          User-Agent: Agent2
          Disallow: /d

          If we want to leave behind the precendence based thing and adopt the model in CC, then I have a small patch for crawler-commons (CC.robots.multiple.agents.patch).

          Show
          Tejas Patil added a comment - I looked at the source code of CC to understand how it works. I have identified the change to be done to CC so that it supports multiple user agents. While testing the same, I have found that there a semantic difference in the way CC works as compared to legacy nutch parser. What CC does: It will split the http.robots.agents over comma (the change that i did locally) It scans the robots file line by line, each time finding if there is a match of the current "User-Agent" from file with any one of from http.robots.agents . If match is found it will take all the corresponding rules for that agent and stop further parsing. robots file User-Agent: Agent1 #foo Disallow: /a User-Agent: Agent2 Agent3 Disallow: /d ------------------------------------ http.robots.agents: "Agent2,Agent1" ------------------------------------ Path: "/a" For the example above, as soon as first line of robots file is scanned, a match for "Agent1" is found. It will scan all the corresponding rules for that agent and will store only this information: User-Agent: Agent1 Disallow: /a Rest all stuff is neglected. What nutch robots parser does: It will split the http.robots.agents over comma. It scans ALL the lines of the robots file and evaluates the matches in terms of the precedence of the user agents. For above example, the rules corresponding to both Agent2 and Agent1 have a match in robots file, but as Agent2 comes first in http.robots.agents , it is given priority and the rules stored will be: User-Agent: Agent2 Disallow: /d If we want to leave behind the precendence based thing and adopt the model in CC, then I have a small patch for crawler-commons (CC.robots.multiple.agents.patch).
          Hide
          Julien Nioche added a comment -

          1. Continue to have the legacy code for parsing robots file.

          2. As an add-in, crawler-commons can be employed for the parsing. User can pick based on a config parameter with a note indicating that #2 wont work with multiple HTTP agents.

          2 is an overkill IMHO. the existing code works fine and the point in moving to CC was to get rid of some of our code, not make it bigger with yet another configuration.

          Lewis : donating out code is a good idea but in the case of the robots parsing it's more about modifying the existing one in CC. I haven't had time to look at robot parsing in CC and am not familiar with it but it would be a good thing to improve it. In the meantime let's go for option 1. Thanks!

          Show
          Julien Nioche added a comment - 1. Continue to have the legacy code for parsing robots file. 2. As an add-in, crawler-commons can be employed for the parsing. User can pick based on a config parameter with a note indicating that #2 wont work with multiple HTTP agents. 2 is an overkill IMHO. the existing code works fine and the point in moving to CC was to get rid of some of our code, not make it bigger with yet another configuration. Lewis : donating out code is a good idea but in the case of the robots parsing it's more about modifying the existing one in CC. I haven't had time to look at robot parsing in CC and am not familiar with it but it would be a good thing to improve it. In the meantime let's go for option 1. Thanks!
          Hide
          Lewis John McGibbney added a comment -

          Is the issue with multiple agents the only downside to using CC just now?
          I think your proposal is great Tejas however if we are looking into supporting CC for more than just robots.txt parsing then maybe we ought to look into donating this aspect of the Nutch code?
          Wdyt?

          Show
          Lewis John McGibbney added a comment - Is the issue with multiple agents the only downside to using CC just now? I think your proposal is great Tejas however if we are looking into supporting CC for more than just robots.txt parsing then maybe we ought to look into donating this aspect of the Nutch code? Wdyt?
          Hide
          Tejas Patil added a comment -

          After waiting for more than a week, I think that there is low chance of getting a fix / change from crawler-commons.
          I propose following:
          1. Continue to have the legacy code for parsing robots file.
          2. As an add-in, crawler-commons can be employed for the parsing.

          User can pick based on a config parameter with a note indicating that #2 wont work with multiple HTTP agents.
          Should this be fine ?

          Show
          Tejas Patil added a comment - After waiting for more than a week, I think that there is low chance of getting a fix / change from crawler-commons. I propose following: 1. Continue to have the legacy code for parsing robots file. 2. As an add-in, crawler-commons can be employed for the parsing. User can pick based on a config parameter with a note indicating that #2 wont work with multiple HTTP agents. Should this be fine ?
          Hide
          Tejas Patil added a comment -

          The current nutch robots parsing logic is uses the later approach for parsing. Having a new API for passing a list of robots names would be a clean solution.

          Show
          Tejas Patil added a comment - The current nutch robots parsing logic is uses the later approach for parsing. Having a new API for passing a list of robots names would be a clean solution.
          Hide
          Ken Krugler added a comment -

          Based on my reading of the robots.txt RFC ("The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring."), this seems like the User-Agent name (what's in the robots.txt file) is searched for a substring that matches the robot name token (what the caller is using).

          So that means in CC we'd either need to assume that a robot name never contains a comma (and we split the caller-provided name) or we add a new API where you pass in a list of robot names. Thoughts?

          Show
          Ken Krugler added a comment - Based on my reading of the robots.txt RFC ("The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring."), this seems like the User-Agent name (what's in the robots.txt file) is searched for a substring that matches the robot name token (what the caller is using). So that means in CC we'd either need to assume that a robot name never contains a comma (and we split the caller-provided name) or we add a new API where you pass in a list of robot names. Thoughts?
          Hide
          Markus Jelsma added a comment -

          I think it would be a very good thing to maintain support for multiple users agents as it provides flexibility to crawler operators to be lenient on how webmasters spell the crawler name in their robots.txt.

          Show
          Markus Jelsma added a comment - I think it would be a very good thing to maintain support for multiple users agents as it provides flexibility to crawler operators to be lenient on how webmasters spell the crawler name in their robots.txt.
          Hide
          Julien Nioche added a comment -

          well we have 2 separate params : http.agent.name which is a single value sent to the servers when fetching and http.robots.agents which can have multiple values and is used for parsing robots. The value of this parameter SHOULD be split based on commas.

          I don't think CC supports multiple values for http.robots.agents, but I'll ask Ken to be sure.

          Show
          Julien Nioche added a comment - well we have 2 separate params : http.agent.name which is a single value sent to the servers when fetching and http.robots.agents which can have multiple values and is used for parsing robots. The value of this parameter SHOULD be split based on commas. I don't think CC supports multiple values for http.robots.agents, but I'll ask Ken to be sure.
          Hide
          Tejas Patil added a comment -

          The changes are done. Please let me know your comments.

          One issue: I am not sure how crawler-commons works for multiple-agents. There is one test case (testRobotsTwoAgents) failing due to that and I am not able to fix it. Can anyone help ?

          Show
          Tejas Patil added a comment - The changes are done. Please let me know your comments. One issue: I am not sure how crawler-commons works for multiple-agents. There is one test case ( testRobotsTwoAgents ) failing due to that and I am not able to fix it. Can anyone help ?
          Hide
          Julien Nioche added a comment -

          crawler-commons is not super active and I have been pretty much the only person actively involved. There have been bugfixes since the release but not necessarily committed IIRC
          The robots parsing is working OK in Nutch and we have loads of other things to work on which are probably more important

          Show
          Julien Nioche added a comment - crawler-commons is not super active and I have been pretty much the only person actively involved. There have been bugfixes since the release but not necessarily committed IIRC The robots parsing is working OK in Nutch and we have loads of other things to work on which are probably more important
          Hide
          Lewis John McGibbney added a comment -

          crawler-commons is available within maven central. Are we still interested in delegating our parsing code to crawler commons? What is the community like over at crawler-commons e.g. if we find bugs in the code how when will/could they get fixed?

          Show
          Lewis John McGibbney added a comment - crawler-commons is available within maven central. Are we still interested in delegating our parsing code to crawler commons? What is the community like over at crawler-commons e.g. if we find bugs in the code how when will/could they get fixed?
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Lewis John McGibbney added a comment -

          Hi Julien, out of shear curiosity, how do we currently parse robots.txt? I found some files (which don't do parsing) in o.a.n.protocol but I've never known what we use for robots.txt

          Show
          Lewis John McGibbney added a comment - Hi Julien, out of shear curiosity, how do we currently parse robots.txt? I found some files (which don't do parsing) in o.a.n.protocol but I've never known what we use for robots.txt

            People

            • Assignee:
              Tejas Patil
              Reporter:
              Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development