Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2379

crawl script dedup's crawldb update is slow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.11
    • 1.17
    • bin
    • None
    • shell

    Description

      In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job).

      In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched by updatedb.

      I notice that the script passes $commonOptions to updatedb but not to dedup. I suspect that the crawldb update launched by dedup may not be compressing its output.

      Attachments

        Activity

          People

            Unassigned Unassigned
            xoffey@gmail.com Michael Coffey
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: