[NUTCH-601] Recrawling on existing crawl directory using force option - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.0.0
Component/s: None
Labels:
None

Patch Info:

Patch Available

Description

Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one can crawl and recrawl in the following manner:

bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force

This option can be used for the first crawl too:

bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force

If one tries to crawl without the -force option when the crawl directory already exists, he/she finds a small warning along with the error message:

# bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
Exception in thread "main" java.lang.RuntimeException: crawl already
exists. Add -force option to recrawl.
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-601v0.1.patch
04/Feb/08 18:11
3 kB
Susam Pal
NUTCH-601v0.2.patch
04/Feb/08 19:19
2 kB
Susam Pal
NUTCH-601v0.3.patch
15/Feb/08 20:31
2 kB
Susam Pal
NUTCH-601v1.0.patch
15/Feb/08 20:56
1 kB
Susam Pal

Activity

People

Assignee:: Andrzej Bialecki

Reporter:: Susam Pal

Votes:: 2 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Feb/08 18:08

Updated:: 10/Apr/09 12:29

Resolved:: 14/Mar/08 14:54