Description
As discussed on the mailing list, index-metadata fails to ignore a webpage with a capitalized robots metatag such as <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">. This only applies when parse-tika is used. parse-html will "decapitalize"
Parsing the attached noindex.html leads to the following results:
parse-html:
bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata" -Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" http://localhost:8080/noindex.html Parse Metadata: [...] metatag.robots=noindex,nofollow robots=noindex,nofollow
parse-tika:
bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" -Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" http://localhost:8080/noindex.html Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW [...] ROBOTS=NOINDEX,NOFOLLOW [...]
The field being named "ROBOTS" and not "robots" leads to parseData.getMeta("robots") being null in https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257.