Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2589

HTML redirections are not followed when using parse-tika

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 1.15
    • Component/s: parser, plugin
    • Labels:
      None

      Description

      Html redirections using meta tags are supported in nutch. They work well when using parse-html to parse files. However, when using parse-tika, they are not detected.

      This is because of https://issues.apache.org/jira/browse/TIKA-2652

      Tika emits redirection meta tags as :

      <meta name="refresh" content="0; url=http://example.com"/>
      

      whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having the following format :

      <meta http-equiv="refresh" content="0; url=http://example.com">
      

      The bug can be reproduced with the following nutch-site.xml:

      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      
      <!-- Put site-specific property overrides in this file. -->
      
      <configuration>
          <property>
              <name>plugin.includes</name>
              <value>protocol-http|parse-tika</value>
          </property>
          <property>
              <name>http.agent.name</name>
              <value>blah</value>
          </property>
      </configuration>
      

      fetching this url: http://www.google.com/policies/technologies/ads/

      The resulting status is

      success(1,0)

      whereas using parse-html, the resulting status is

      success(1,100), args[0]=https://policies.google.com/technologies/ads, args[1]=0

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              gbouchar Gerard Bouchar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: