[NUTCH-381] Ignore external link not work as expected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Won't Fix
Affects Version/s: 0.8.1
Fix Version/s: 0.9.0
Component/s: None
Labels:
None

Description

Currently there is no way to properly limit fetcher without regexp rules we use ignore.external.link option but It seams that It doesn't work in all cases.
Here is example urls I'm seeing but

cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit.

fetching http://help.yahoo.com/help/sports
fetching http://www.turkish-xxx.com/adult-traffic-trade.php
fetching http://help.yahoo.com/help/us/astr/
fetching http://www.polish-xxx.com/de-index.html
fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx
fetching http://help.yahoo.com/help/groups
fetching http://help.yahoo.com/help/fin/
fetching http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx
fetching http://help.yahoo.com/help/us/edit/
fetching http://www.polish-xxx.com/es-index.html

Anyone notice this?

I assume that there must be something with expired domains where pages generates randomly. But still why urls from other domain was added. Maybe urlregexp filter +* exclude.

Attachments

Activity

People

Assignee:: Andrzej Bialecki

Reporter:: Uros Gruber

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Oct/06 19:34

Updated:: 19/Mar/07 23:43

Resolved:: 19/Mar/07 23:43