[NUTCH-2496] Speed up link inversion step in crawling script - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.15
Fix Version/s: 1.17
Component/s: linkdb
Labels:
None

Patch Info:

Patch Available

Description

While working on a project where I have to index a huge number of URLs I encountered an issue with the link inversion step of the crawling script. A while ago Ian Lopata stumbled upon the same issue as described here: http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html

I am running the invertlinks step in my Nutch 1.6 based crawl process on a
single node. I run invertlinks only because I need the Inlinks in the
indexer step so as to store them with the document. I do not need the
anchor text and I am not scoring. I am finding that invertlinks (and more
specifically the merge of the linkdb) takes a long time - about 30 minutes
for a crawl of around 150K documents. I am looking for ways that I might
shorten this processing time. Any suggestions?

Back then wastl-nagel suggested turning off the normalizers and filters during the inversion step which speeds up the process a bunch.
In my case however I kind of depend on those so this is no real solution.

I opened this issue here in order to get some feedback on how we could improve things in a crawl script and speed up the process.

Attachments

Issue Links

links to

Github Pull Reqest #527

Activity

People

Assignee:: Lewis John McGibbney

Reporter:: Moreno Feltscher

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 12/Jan/18 23:32

Updated:: 28/Jan/21 13:15

Resolved:: 09/Jun/20 10:47