[SOLR-12026] SimplePostTool with robots.txt - ASF JIRA

XML

Word

Printable

JSON

[First issue here, apologies in advance for missteps.]

Three things which could improve working with robots.txt:

When fetching the corresponding robots.txt for a URL, the port is ignored and so it defaults to :80. If nothing is listening :80, it fetches the page. isDisallowedByRobots() could include the url.getPort() when constructing strRobot. This helps when testing your robots on a non-standard port, such as during development.
Disallow directives are applied regardless of User-agent. parseRobotsTxt() could override a Disallow which specifies SimplePostTool-crawler. This would help when indexing your own site which you've explicitly allowed for indexing by SimplePostTool. I don't know if that's a good practice, but it would help in testing.
The User-agent header when fetching robots.txt is not "SimplePostTool-crawler" but shows as "Java/<version>". The code which sets the header correctly from readPageFromUrl() could be reused in isDisallowedByRobots().