Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1468

Redirects that are external links not adhering to db.ignore.external.links

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.1
    • 2.1
    • fetcher
    • None
    • Patch Available

    Description

      Patch attached for this.

      Hi,

      Likely this is a question for Ferdy but if anyone else has input
      that'd be great. When running a crawl that I would expect to be
      contained to a single domain I'm seeing the crawler jump out to other
      domains. I'm using the trunk of Nutch 2.x which includes the following
      commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200

      The goal is to perform a focused crawl against a single domain and
      restrict the crawler from expanding beyond that domain. I've set the
      db.ignore.external.links property to true. I do not want to add a
      regex to regex-urlfilter.txt as I will be adding several thousand
      urls. The domain that I am crawling has documents with outlinks that
      are still within the domain but then redirect to external domains.

      cat urls/seed.txt
      http://www.ci.watertown.ma.us/

      cat conf/nutch-site.xml
      ...
      <property>
      <name>db.ignore.external.links</name>
      <value>true</value>
      <description>If true, outlinks leading from a page to external hosts
      will be ignored. This is an effective way to limit the crawl to include
      only initially injected hosts, without creating complex URLFilters.
      </description>
      </property>

      <property>
      <name>plugin.includes</name>
      <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
      <description>Regular expression naming plugin directory names to
      include. Any plugin not matching this expression is excluded.
      In any case you need at least include the nutch-extensionpoints plugin. By
      default Nutch includes crawling just HTML and plain text via HTTP,
      and basic indexing and search plugins. In order to use HTTPS please enable
      protocol-httpclient, but be aware of possible intermittent
      problems with the
      underlying commons-httpclient library.
      </description>
      </property>
      ...

      Running
      bin/nutch crawl urls -depth 8 -topN 100000

      results in the the crawl eventually fetching and parsing documents on
      domains external to the only link in the seed.txt file.

      I would not expect to see urls like the following in my logs and in
      the HBase webpage table:

      fetching http://www.masshome.com/tourism.html
      Parsing http://www.disabilityinfo.org/

      I'm reviewing the code changes but am still getting up to speed on the
      code base. Any ideas while I continue to dig around? Configuration
      issue or code?

      Thanks,
      Matt

      Attachments

        1. redirects-to-external.patch
          2 kB
          Matt MacDonald

        Activity

          People

            Unassigned Unassigned
            driki Matt MacDonald
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: