Description
The DeduplicationJob ("nutch dedup") fails to install the deduplicated CrawlDb and leaves only the "old" crawldb (if "db.preserve.backup" is true):
% tree crawldb crawldb ├── current │ └── part-r-00000 │ ├── data │ └── index └── old └── part-r-00000 ├── data └── index % bin/nutch dedup crawldb DeduplicationJob: starting at 2018-04-22 21:48:08 Deduplication: 6 documents marked as duplicates Deduplication: Updating status of duplicate urls into crawl db. Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/crawldb/1742327020 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:374) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:613) at org.apache.nutch.util.FSUtils.replace(FSUtils.java:58) at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:212) at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:225) at org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:366) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:379) % tree crawldb crawldb └── old └── part-r-00000 ├── data └── index
In pseudo-distributed mode it's even worse: only the "old" CrawlDb is left without any error.
Attachments
Issue Links
- links to