Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-817

parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.1
    • None
    • parser
    • None
    • Suse linux 11.1, java version "1.6.0_13"

    Description

      submitted per Julien Nioche. I did not see where to attach a file so I pasted it here. btw: Tika command line returns empty html body for this file.

      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">

      <!-NewPage->

      <HTML>

      <HEAD>

      <!-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008-->

      <TITLE>

      Matrix Application Development Kit

      </TITLE>

      <SCRIPT type="text/javascript">

      targetPage = "" + window.location.search;

      if (targetPage != "" && targetPage != "undefined")

      targetPage = targetPage.substring(1);

      function loadFrames()

      { if (targetPage != "" && targetPage != "undefined") top.classFrame.location = top.targetPage; }

      </SCRIPT>

      <NOSCRIPT>

      </NOSCRIPT>

      </HEAD>

      <FRAMESET cols="20%,80%" title="" onLoad="top.loadFrames()">

      <FRAMESET rows="30%,70%" title="" onLoad="top.loadFrames()">

      <FRAME src="overview-frame.html" name="packageListFrame" title="All Packages">

      <FRAME src="allclasses-frame.html" name="packageFrame" title="All classes and interfaces (except non-static nested types)">

      </FRAMESET>

      <FRAME src="overview-summary.html" name="classFrame" title="Package, class and interface descriptions" scrolling="yes">

      <NOFRAMES>

      <H2>

      Frame Alert</H2>

      <P>

      This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client.

      <BR>

      Link to<A HREF="overview-summary.html">Non-frame version.</A>

      </NOFRAMES>

      </FRAMESET>

      </HTML>

      Attachments

        1. sample-javadoc.html
          1 kB
          matthew a. grisius

        Issue Links

          Activity

            People

              jnioche Julien Nioche
              mgrisius matthew a. grisius
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: