Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2379

crawl script dedup's crawldb update is slow

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.11
    • Fix Version/s: 1.17
    • Component/s: bin
    • Labels:
      None
    • Environment:

      shell

      Description

      In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job).

      In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched by updatedb.

      I notice that the script passes $commonOptions to updatedb but not to dedup. I suspect that the crawldb update launched by dedup may not be compressing its output.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              xoffey@gmail.com Michael Coffey
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: