[NUTCH-1184] Fetcher to parse and follow Nth degree outlinks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5
Component/s: fetcher
Labels:
None

Patch Info:

Patch Available

Description

Fetcher improvements to parse and follow outlinks up to a specified depth. The number of outlinks to follow can be decreased by depth using a divisor. This patch introduces three new configuration directives:

<property>
  <name>fetcher.follow.outlinks.depth</name>
  <value>-1</value>
  <description>(EXPERT)When fetcher.parse is true and this value is greater than 0 the fetcher will extract outlinks
  and follow until the desired depth is reached. A value of 1 means all generated pages are fetched and their first degree
  outlinks are fetched and parsed too. Be careful, this feature is in itself agnostic of the state of the CrawlDB and does not
  know about already fetched pages. A setting larger than 2 will most likely fetch home pages twice in the same fetch cycle.
  It is highly recommended to set db.ignore.external.links to true to restrict the outlink follower to URL's within the same
  domain. When disabled (false) the feature is likely to follow duplicates even when depth=1.
  A value of -1 of 0 disables this feature.
  </description>
</property>

<property>
  <name>fetcher.follow.outlinks.num.links</name>
  <value>4</value>
  <description>(EXPERT)The number of outlinks to follow when fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply
  the total number of pages to fetch. This works with fetcher.follow.outlinks.depth.divisor, by default settings the followed outlinks
  at depth 1 is 8, not 4.
  </description>
</property>

<property>
  <name>fetcher.follow.outlinks.depth.divisor</name>
  <value>2</value>
  <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per fetcher.follow.outlinks.depth. This decreases the number
  of outlinks to follow by increasing depth. The formula used is: outlinks = floor(divisor / depth * num.links). This prevents
  exponential growth of the fetch list.
  </description>
</property>

Please, do not use this unless you know what you're doing. This feature does not consider the state of the CrawlDB nor does it consider generator settings such as limiting the number of pages per (domain|host|ip) queue. It is not polite to use this feature with high settings as it can fetch many pages from the same domain including duplicates.

Also, this feature will not work if fetcher.parse is disabled. With parsing enabled you might want to consider not to store downloaded content.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1184-1.5-1.patch
27/Oct/11 16:10
9 kB
Markus Jelsma
NUTCH-1184-1.5-2.patch
28/Oct/11 12:02
8 kB
Markus Jelsma
NUTCH-1184-1.5-3.patch
31/Oct/11 15:25
13 kB
Markus Jelsma
NUTCH-1184-1.5-4.patch
31/Oct/11 15:28
13 kB
Markus Jelsma
NUTCH-1184-1.5-5.patch
02/Nov/11 11:28
23 kB
Markus Jelsma
NUTCH-1184-1.5-5-ParseData.patch
11/Nov/11 15:23
0.5 kB
Markus Jelsma
NUTCH-1185-1.5-6.patch
15/Nov/11 15:26
31 kB
Markus Jelsma
NUTCH-1185-1.5-7.patch
15/Nov/11 17:19
31 kB
Markus Jelsma
NUTCH-1185-1.5-8.patch
15/Nov/11 17:39
32 kB
Markus Jelsma
NUTCH-1185-1.5-9.patch
16/Nov/11 11:59
33 kB
Markus Jelsma
NUTCH-1184-1.5-9-ParseOutputFormat.patch
25/Nov/11 13:08
6 kB
Markus Jelsma

Issue Links

incorporates

NUTCH-1174 Outlinks are not properly normalized

Closed

NUTCH-1346 Follow outlinks to ignore external

Closed

NUTCH-1212 ParseOutputFormat has redundant code

Closed

relates to

NUTCH-1150 http.redirect.max can lead to multiple parses of the same url

Closed

Activity

People

Assignee:: Markus Jelsma

Reporter:: Markus Jelsma

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 27/Oct/11 15:40

Updated:: 22/May/13 03:53

Resolved:: 20/Dec/11 10:11