Description
In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job).
In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched by updatedb.
I notice that the script passes $commonOptions to updatedb but not to dedup. I suspect that the crawldb update launched by dedup may not be compressing its output.