[NUTCH-498] Use Combiner in LinkDb to increase speed of linkdb generation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 1.0.0
Component/s: linkdb
Labels:
None

Description

I tried to add the follwing combiner to LinkDb

public static enum Counters

{COMBINED}

public static class LinkDbCombiner extends MapReduceBase implements Reducer {
private int _maxInlinks;

@Override
public void configure(JobConf job)

{ super.configure(job); _maxInlinks = job.getInt("db.max.inlinks", 10000); }

public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
final Inlinks inlinks = (Inlinks) values.next();
int combined = 0;
while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext() {
if (inlinks.size() >= _maxInlinks) {
if (combined > 0)

{ reporter.incrCounter(Counters.COMBINED, combined); }

output.collect(key, inlinks);
return;
}
Inlink in = (Inlink) it.next();
inlinks.add(in);
}
combined++;
}
if (inlinks.size() == 0)

{ return; }

if (combined > 0)

{ reporter.incrCounter(Counters.COMBINED, combined); }

output.collect(key, inlinks);
}
}

This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.

Map output records 8717810541
Combined 7632541507
Resulting output rec 1085269034

That's a 87% reduction of output records from the map phase

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LinkDbCombiner.patch
15/Jun/07 09:20
1 kB
Espen Amble Kolstad
LinkDbCombiner.patch
15/Jun/07 14:07
0.5 kB
Espen Amble Kolstad

Activity

People

Assignee:: Dogacan Guney

Reporter:: Espen Amble Kolstad

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 14/Jun/07 08:05

Updated:: 10/Apr/09 12:29

Resolved:: 27/Jun/07 12:46