Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2496

Speed up link inversion step in crawling script

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.15
    • 1.17
    • linkdb
    • None
    • Patch Available

    Description

      While working on a project where I have to index a huge number of URLs I encountered an issue with the link inversion step of the crawling script. A while ago Ian Lopata stumbled upon the same issue as described here: http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html

      I am running the invertlinks step in my Nutch 1.6 based crawl process on a
      single node. I run invertlinks only because I need the Inlinks in the
      indexer step so as to store them with the document. I do not need the
      anchor text and I am not scoring. I am finding that invertlinks (and more
      specifically the merge of the linkdb) takes a long time - about 30 minutes
      for a crawl of around 150K documents. I am looking for ways that I might
      shorten this processing time. Any suggestions?

      Back then wastl-nagel suggested turning off the normalizers and filters during the inversion step which speeds up the process a bunch.
      In my case however I kind of depend on those so this is no real solution.

      I opened this issue here in order to get some feedback on how we could improve things in a crawl script and speed up the process.

      Attachments

        Issue Links

          Activity

            People

              lewismc Lewis John McGibbney
              mfeltscher Moreno Feltscher
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: