[NUTCH-2379] crawl script dedup's crawldb update is slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.11
Fix Version/s: 1.17
Component/s: bin
Labels:
None
Environment:

shell

Description

In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job).

In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched by updatedb.

I notice that the script passes $commonOptions to updatedb but not to dedup. I suspect that the crawldb update launched by dedup may not be compressing its output.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Michael Coffey

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/May/17 22:52

Updated:: 28/Jan/21 13:16

Resolved:: 28/Apr/20 08:58