Description
Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb.
allcrawldb="crawl/allcrawldb"
temp_crawldb="crawl/temp_crawldb"
merge_dbs="$it_crawldb $allcrawldb"
- if [[ ! -d $allcrawldb ]]
- then
- merge_dbs="$it_crawldb"
- fi
- uncomment the above and mergedb will work fine.
bin/nutch mergedb $temp_crawldb $merge_dbs
rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
mv $temp_crawldb $allcrawldb
This is the exception that occurs:
bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
CrawlDb merge: starting at 2011-03-27 10:13:06
Adding crawl/crawldb
Adding crawl/allcrawldb
CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)
Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface.
Attachments
Attachments
Issue Links
- is related to
-
NUTCH-971 IndexMerger produces indexes itself cannot merge anymore
- Closed