[NUTCH-1772] Injector does not need merging if no pre-existing crawldb - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8
Fix Version/s: 1.9
Component/s: injector
Labels:
None

Patch Info:

Patch Available

Description

The injector currently works as following :

MapReduce job 1 - Mapper : converts input lines into CrawlDatum objects with normalisation and filtering
MapReduce job 1 - Reducer : identity reducers. Can still have duplicates at this stage
MapReducer job 2 - Mapper : CrawlDbFilter on existing crawldb (if any) + output of previous job
MapReducer job 2 - Reducer : deduplication

If there is no existing crawldb (which will often be the case at injection time) we don't really need to do the second mapreduce job and could simply take the output of the MR job #1 as CrawlDB provided that we do the deduplication as part of the reduce step.
If there is a crawldb then the reduce step of the MR job #1 is not really needed and we could have that step as map only.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1772-Logging&ErrorHandling.patch
14/May/14 22:40
26 kB
Diaa
NUTCH-1772.patch
12/May/14 15:32
25 kB
Julien Nioche

Issue Links

duplicates

NUTCH-1712 Use MultipleInputs in Injector to make it a single mapreduce job

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Julien Nioche

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 12/May/14 15:30

Updated:: 13/Mar/24 14:51

Resolved:: 16/May/14 07:59