[NUTCH-972] Mergedb doesn't merge with empty directory, as is the case with merge (for indexes) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.2
Fix Version/s: 1.3
Component/s: storage
Labels:
- patch

Description

Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb.

allcrawldb="crawl/allcrawldb"
temp_crawldb="crawl/temp_crawldb"
merge_dbs="$it_crawldb $allcrawldb"

if [[ ! -d $allcrawldb ]]
then
merge_dbs="$it_crawldb"
fi
uncomment the above and mergedb will work fine.
bin/nutch mergedb $temp_crawldb $merge_dbs
rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
mv $temp_crawldb $allcrawldb

This is the exception that occurs:

bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
CrawlDb merge: starting at 2011-03-27 10:13:06
Adding crawl/crawldb
Adding crawl/allcrawldb
CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)

Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

check_empty.diff
27/Mar/11 09:15
0.3 kB
Gabriele Kahlout

Issue Links

is related to

NUTCH-971 IndexMerger produces indexes itself cannot merge anymore

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Gabriele Kahlout

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Mar/11 09:10

Updated:: 25/Jun/11 12:53

Resolved:: 08/Apr/11 11:06