Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
7.2
-
None
-
None
Description
[First issue here, apologies in advance for missteps.]
Three things which could improve working with robots.txt:
- When fetching the corresponding robots.txt for a URL, the port is ignored and so it defaults to :80. If nothing is listening :80, it fetches the page. isDisallowedByRobots() could include the url.getPort() when constructing strRobot. This helps when testing your robots on a non-standard port, such as during development.
- Disallow directives are applied regardless of User-agent. parseRobotsTxt() could override a Disallow which specifies SimplePostTool-crawler. This would help when indexing your own site which you've explicitly allowed for indexing by SimplePostTool. I don't know if that's a good practice, but it would help in testing.
- The User-agent header when fetching robots.txt is not "SimplePostTool-crawler" but shows as "Java/<version>". The code which sets the header correctly from readPageFromUrl() could be reused in isDisallowedByRobots().