[NUTCH-1468] Redirects that are external links not adhering to db.ignore.external.links - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1
Fix Version/s: 2.1
Component/s: fetcher
Labels:
None

Patch Info:

Patch Available

Description

Patch attached for this.

Hi,

Likely this is a question for Ferdy but if anyone else has input
that'd be great. When running a crawl that I would expect to be
contained to a single domain I'm seeing the crawler jump out to other
domains. I'm using the trunk of Nutch 2.x which includes the following
commit: https://github.com/apache/nutch/commit/c5e2236f36a881ee7fec97aff3baf9bb32b40200

The goal is to perform a focused crawl against a single domain and
restrict the crawler from expanding beyond that domain. I've set the
db.ignore.external.links property to true. I do not want to add a
regex to regex-urlfilter.txt as I will be adding several thousand
urls. The domain that I am crawling has documents with outlinks that
are still within the domain but then redirect to external domains.

cat urls/seed.txt
http://www.ci.watertown.ma.us/

cat conf/nutch-site.xml
...
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent
problems with the
underlying commons-httpclient library.
</description>
</property>
...

Running
bin/nutch crawl urls -depth 8 -topN 100000

results in the the crawl eventually fetching and parsing documents on
domains external to the only link in the seed.txt file.

I would not expect to see urls like the following in my logs and in
the HBase webpage table:

fetching http://www.masshome.com/tourism.html
Parsing http://www.disabilityinfo.org/

I'm reviewing the code changes but am still getting up to speed on the
code base. Any ideas while I continue to dig around? Configuration
issue or code?

Thanks,
Matt

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

redirects-to-external.patch
09/Sep/12 11:35
2 kB
Matt MacDonald

Activity

People

Assignee:: Unassigned

Reporter:: Matt MacDonald

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Sep/12 11:35

Updated:: 05/Sep/14 05:58

Resolved:: 17/Sep/12 09:27