Issue Details (XML | Word | Printable)

Key: NUTCH-173
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Unassigned
Reporter: Philippe EUGENE
Votes: 2
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

PerHost Crawling Policy ( crawl.ignore.external.links )

Created: 13/Jan/06 06:17 PM   Updated: 19/Jul/06 05:34 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.7, 0.7.1, 0.8
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works patch.txt 2006-01-13 06:20 PM Philippe EUGENE 1 kB
Text File Licensed for inclusion in ASF works patch08-new.patch 2006-05-21 01:47 AM Stefan Neufeind 3 kB
Text File Licensed for inclusion in ASF works patch08.txt 2006-01-13 06:20 PM Philippe EUGENE 2 kB
Issue Links:
Reference
 

Resolution Date: 19/Jul/06 05:34 PM


 Description  « Hide
There is two major way of crawl in Nutch.

Intranet Crawl : forbidden all, allow somes few host

Whole-web crawl : allow all, forbidden few thinks

I propose a third type of crawl.

Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.

I made two patch for : 0.7, 0.7.1 and 0.8-dev

I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
By default this new feature don't modify the behavior of nutch crawler.

When you setup this property to true, the crawler don't fetch external links of the host.
So the crawl is limited to the host that you inject at the beginning at the crawl.

I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch.
This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.

I post two patch.
Sorry for my very poor english

Philippe



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Philippe EUGENE made changes - 13/Jan/06 06:20 PM
Field Original Value New Value
Attachment patch.txt [ 12321919 ]
Philippe EUGENE made changes - 13/Jan/06 06:20 PM
Attachment patch08.txt [ 12321920 ]
Stefan Neufeind made changes - 19/May/06 04:14 AM
Link This issue relates to NUTCH-271 [ NUTCH-271 ]
Stefan Neufeind made changes - 21/May/06 01:47 AM
Attachment patch08-new.patch [ 12334366 ]
Andrzej Bialecki made changes - 19/Jul/06 05:34 PM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Closed [ 6 ]