Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2589

HTML redirections are not followed when using parse-tika

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.14
    • 1.15
    • parser, plugin
    • None

    Description

      Html redirections using meta tags are supported in nutch. They work well when using parse-html to parse files. However, when using parse-tika, they are not detected.

      This is because of https://issues.apache.org/jira/browse/TIKA-2652

      Tika emits redirection meta tags as :

      <meta name="refresh" content="0; url=http://example.com"/>
      

      whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having the following format :

      <meta http-equiv="refresh" content="0; url=http://example.com">
      

      The bug can be reproduced with the following nutch-site.xml:

      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      
      <!-- Put site-specific property overrides in this file. -->
      
      <configuration>
          <property>
              <name>plugin.includes</name>
              <value>protocol-http|parse-tika</value>
          </property>
          <property>
              <name>http.agent.name</name>
              <value>blah</value>
          </property>
      </configuration>
      

      fetching this url: http://www.google.com/policies/technologies/ads/

      The resulting status is

      success(1,0)

      whereas using parse-html, the resulting status is

      success(1,100), args[0]=https://policies.google.com/technologies/ads, args[1]=0

      Attachments

        Activity

          People

            Unassigned Unassigned
            gbouchar Gerard Bouchar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: