Description
Html redirections using meta tags are supported in nutch. They work well when using parse-html to parse files. However, when using parse-tika, they are not detected.
This is because of https://issues.apache.org/jira/browse/TIKA-2652
Tika emits redirection meta tags as :
<meta name="refresh" content="0; url=http://example.com"/>
whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having the following format :
<meta http-equiv="refresh" content="0; url=http://example.com">
The bug can be reproduced with the following nutch-site.xml:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>plugin.includes</name> <value>protocol-http|parse-tika</value> </property> <property> <name>http.agent.name</name> <value>blah</value> </property> </configuration>
fetching this url: http://www.google.com/policies/technologies/ads/
The resulting status is
success(1,0)
whereas using parse-html, the resulting status is
success(1,100), args[0]=https://policies.google.com/technologies/ads, args[1]=0